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Abstract 

Long branch delay is a well-known problem in today's high per- 
formance superscalar and superpipeline processor designs. A com- 
mon technique used to alleviate this problem is to predict the direc- 
tion of branches during the instruction fetch. Counter-based branch 
prediction, in particular, has been reported as an effective scheme for 
predicting the direction of branches. However, its accuracy is gener- 
ally limited by branches whose future behavior is also dependent 
upon the history of other branches. To enhance branch prediction ac- 
curacy with a minimum increase in hardware cost, we propose a cor- 
relation-based scheme and show how the prediction accuracy can be 
improved by incorporating information, not only from the history of 
a specific branch, but also from the history of other branches. Specif- 
ically, we use the information provided by a proper subhistory of a 
branch to predict the outcome of that branch. The proper subhistory 
is selected based on the outcomes of the most recently executed M 
branches. The new scheme is evaluated using traces collected from 
running the SPEC benchmark suite on an IBM RISC System/6000 
workstation. The results show that, as compared with the 2-bit coun- 
ter-based prediction scheme, the correlation-based branch predic- 
tion achieves up to 1 1% additional accuracy at the extra hardware 
cost of one shift register. The results also show that the accuracy of 
the new scheme surpasses that of the counter-based branch predic- 
tion at saturation. 

1. Introduction 

Recent advances in RISC architectures and VLSI technologies 
allow computer designers to exploit more instruction-level parallel- 
ism with deeper pipelines and more concurrent functional units [1, 
2]. As sophisticated processors are built to exploit the available in- 
Permission to copy without fee ell or part of this material is 
granted provided that the copies are not made or distributed for 
direct commercial advantage, the ACM copyright notice and the 
title of the publication and its date appear, and notice is given 
that copying is by permission of the Association for Computing 
Machinery. To copy otherwise, or to republish, requires a fee 
and/or specific permission. 
ASPLOS V- 10/92/MA.USA 

* 1992 ACM 0-89791-535-6/92/0010/0076.. .$1.50 



struction-level parallelism, more attention needs to be paid to the 
disruption of pipeline flow as a result of branch instruction execution 
[8] . Pipeline disruption reduces the effective instruction throughput 
by introducing extra delays in the pipeline. Since branches constitute 
a large portion of all the executed instructions, the efficiency of han- 
dling branches is important. Our primary interest is in reducing the 
branch penalty incurred in executing conditional branches. All 
branches mentioned below, unless otherwise stated, are conditional 
branches. 

Almost all the branch cost reduction techniques reported in the 
literature require the use of some mechanism for predicting the out- 
come of branches. Other than the profiling technique [3, 5], all pre- 
diction schemes require hardware assistance. Hardware-assisted 
branch predictions typically fall into two categories: static and cfy- 
namic. Overview of these schemes can be found in [4, 7]. Generally, 
dynamic prediction gives better results than static prediction, but at 
the cost of increased hardware complexity. A less-complex yet rea- 
sonably effective scheme is the N-bit counter scheme [3, 4, 7]. In 
this scheme, the prediction of the outcome of a branch is based on the 
output of a finite-state machine whose state is recorded in an N-bit 
up/down counter. The counter is incremented or decremented ac- 
cording to whether the branch is taken or not. We refer to this scheme 
as the counter-based branch prediction. Later we will show its oper- 
ation in more detail. 

A common limitation with most of the dynamic branch predic- 
tion schemes is that theprediction is based on "self-history". Specif- 
ically, the prediction is based exclusively on the past history of the 
branch under consideration, completely ignoring the information 
provided by the executions of other branches. Self-history predic- 
tion schemes generally work well for scientific/engineering applica- 
tions where program execution is dominated by inner-loops. How- 
ever, in many integer workloads, control-flows are complex and 
very often the outcome of a branch is affected by the outcomes of re- 
cently executed branches. In other words, the branches are corre- 
lated. Because of the correlation, the history of a branch, considered 
by itself, is very chaotic and that reduces the accuracy of self-history 
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prediction schemes. A prior study shows that branch correlation 
does take place in programs and that its source can be traced back to 
common high-level language constructs [6]. The Appendix summa- 
rizes someof our observations of the source code-level branch corre- 
lation that appear in the SPEC integer benchmarks. 

Contrary to the self-history based approach, the "two-level 
adaptive training branch prediction" reported recently uses the 
global branch history pattern associated with each branch address for 
predicting the outcome of the branch [10]. The same global history 
pattern results in the same prediction, regardless of which branch ad- 
dress the history pattern is associated with. Although this approach 
is reported to produce a fairly high prediction accuracy [ 1 0] , its hard- 
ware implementation seems quite complicated. 

To our knowledge, very little work has been done in addressing 
the issue of branch correlation in branch prediction. In this paper, we 
study the effect of branch correlation in branch prediction and pro- 
pose a correlation-based prediction scheme which also produces 
high prediction accuracy. The proposed branch prediction scheme 
is simple to implement and its implementation is very similar to that 
of the counter-based branch prediction. 

The new scheme is evaluated using traces collected from running 
the SPEC benchmark suite [9] on an IBM RISC System/6000 work- 
station. The results show that, as compared with the 2-bit counter- 
based prediction scheme, the correlation-based branch prediction 
achieves up to 1 1% additional accuracy at the extra hardware cost of 
one shift register. The results also show that the accuracy of the new 
scheme surpasses that of the counter-based branch prediction at sat- 
uration. 

The remainder of this paper is organized as follows: In section 
2, the correlation-based branch prediction scheme is introduced with 
an example. A brief description of the counter-based branch predic- 
tion is also given. In section 3, simulation results evaluating the new 
scheme are presented. In section 4, we give the main conclusions. 

2. Dynamic Branch Prediction 

In this section, we will describe the N-bit counter scheme and 
introduce a new prediction scheme based on branch correlation. An 
example will be given to explain the difference between these two 
schemes. Finally, the implementation of the new scheme will be dis- 
cussed. 

2.1 Counter-Based Branch Prediction 

The basic idea for the counter-based branch prediction is to use 
an N-bit up/down counter [3. 4. 7] for prediction. In the ideal case, 
an N-bit counter (with some initial value) is assigned to each static 
branch (branches with distinct addresses). When a branch is about 
to be executed, the counter value C, associated with that branch, is 
used for prediction. If C is greater than or equal to a predetermined 



threshold value L, the branch is predicted taken, otherwise it is pre- 
dicted not taken. A typical value for L is 2 N_1 . The counter value C 
is updated whenever that branch is resolved. If the branch is taken, 
C is incremented by one, otherwise it is decremented by one. If C 
is 2 N -1, it remains at that value as long as the branch is taken. If C 
is 0, it remains at zero as long as the branch is not taken. 

predict not taken <— L=2 — > predict taken 

i i ! i 




0 0 
actual result: l=taken; 0=not taken 
Fig. 1 FSM for the 2-bit Counter Scheme 

The operation of the N-bit counter scheme corresponds to a fini- 
te-state machine (FSM) with 2 N states. Fig. 1 shows the FSM with 
N=2 and L=2. Smith [7] reported that a counter of 2 bits is usually 
as good or better than other strategies and a larger counter size does 
not necessarily give better results. 

2.2 Correlation-Based Branch Prediction 

Most studies of dynamic branch prediction focus on the history 
of the branch under consideration [4, 7], With hardware-assisted 
branch prediction, only the most recent history of a branch is used to 
predict the outcome of that branch. These branch prediction schemes 
work well for scientific/engineering workloads where program ex- 
ecution is dominated by inner-loops. However, they do not work as 
well for integer workloads where the outcome of a branch is affected 
by the outcomes of recently executed branches. When one branch 
depends on another, in the sense that its outcome depends on the out- 
come of the other branch, we say that the branches are correlated. 

As an illustration of branch correlation, consider the code frag- 
ment shown in Fig. 2: 



if(aa^2) 

aa = 0; 
if(bb==2) 

bb = 0; 
if(aa .'= bb) { 



/*ba*/ 
/♦ba*/ 



Fig. 2 A Code Fragment from SPEC Benchmark eqntott 

This code fragment (other than the comments) appears in a frequent- 
ly executed block of the SPEC integer benchmark eqntott. There are 
three (/-statements in this code fragment. Assume that the j/^state- 
ments are converted by a compiler to three branch instructions bi, b?, 
and b3, and the action determined by each (/^statement is the branch 
"fall-through path", meaning that the branch "taken" path is the path 
for which the condition is not true. Since the outcome of bs depends 
on the values of aa and bb, it is obvious that 03 is correlated with bi 
and b%. 
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path: A: 0-0 B: 0-1 C: 1-0 D: 1-1 
Fig. 3 Branch Tree for the Code Fragment Given in Fig. 2 

Although the presence of branch correlation may cause the be- 
havior of a branch to appear more random, it may shed some light on 
the condition upon which the branch decision is based. Consider 
again the same example given in Fig. 2. After the executions of b! 
and l>2. the condition that bj is dependent upon is already partially 
known. Fig. 3 shows the part of the branch tree before the execution 
of D3 given that bi and 02 have been executed. 

There are four possible paths reaching b3 through the executions 
of bi and D2. For example, if bi is taken and b 2 is not taken, then D3 
is reached via the 1-0 path (path C in Fig. 3). Fig. 4 shows the infor- 
mation available at D3, given that and b 2 have been executed. It is 
clear that if D3 is reached via the 0-0 path, the outcome of bs can be 
determined prior to its execution. But this situation cannot be ex- 
ploited by the conventional self-history based prediction schemes. 
This example suggests that the outcome of a branch can be more 
readily determined if the path leading to it is known. By splitting the 
branch history of D3 into four sub histories according to the paths 
leading to D3, one may reduce the randomness of the apparent behav- 
ior of D3 and thus make a better prediction. 

path leading A: 0-0 B: 0-1 C: 1-0 D: 1-1 
1003: 

raa=Q raa~\) retail r<M&2 

IM>=0 \bb*2 \bb=Q Ibfoa 

Fig. 4 Information About aa, bb Available at b 3 
After bi and D2 Have Been Executed 

Let's further examine the example with data that are arbitrarily 
chosen only to reflect the branch correlation. Suppose that we run 
the code fragment given in Fig. 2 on a machine which implements 
the 2-bit counter scheme shown in Fig. 1 with initial state set to 0. 
Table 1 shows the predicted outcomes of 03 and the state transitions. 
The first two columns show the initial values of aa and bb before the 
execution of b\. Columns aa* and bb' in the table show the new val- 
ues of aa and bb after b x and D2 are executed. Column "path" indi- 
cates the path from which D3 is reached. Column "curr state" shows 
the current state of the FSM. Column "pred" shows the predicted 
outcome of 03. The actual outcome is given in column "act". Col- 
umn "c/w" indicates the correct (c) or wrong (w) prediction. The 
state is updated according to the current state and the actual outcome. 



The updated state is shown in the column under "next state". 

Table 1 State Transitions and Branch Predictions for t>3 
Using 2-bit Counter-Based Prediction Scheme 
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N=not taken, T=taken, c=correct pred, w=wrong pred. 



A careful inspection of the table reveals that the apparentfy ran- 
dom branch history of D3 (column "act) is actually formed by inter- 
weav ing four less random branch subhistories, each of which is asso- 
ciated with a branch path leading to 03 (compare columns "path" and 
"act**). After splitting the branch history of D3 according to the four 
branch paths shown in Fig . 5, one can obtain the four branch subhis- 
tories of b 3 . 

time -t> Branch Paths 



CABBADDB DDCCABDACDDA 
TTNTTNNTNTNNTTNTTNNT 




TTTTT NTTT TNNT NNNTNNN 

path A pathB pathC pathD 

Fig. 5 Subhistories Obtained by Splitting the 
History of D3 According to the Branch Paths 

It is evident from Fig. 5 that the outcomes of D3 axe less random 
within each subhistory. Hence better predictions are expected if we 
independently implement a 2-bit counter for each subhistory. In fact, 
only 3 out of the 20 executions of 03 are correctly predicted if only 
one 2-bit counter (with initial state equal to 0) is used. However, if 
four 2-bit counters are used (all initialized to 0), with one for each 
subhistory, 10 additional correct predictions can be obtained Note 
that the state transition and the state update of the FSM associated 
with each counter are local to each branch path. This is shown in Fig. 
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6. Notice that we are not suggesting to use four counters for "each" 
branch. We are merely showing that taking the path leading to a 
branch into consideration leads to a better prediction. Later we will 
show an implementation scheme that exploits the "correlation" be- 
tween branches without increasing the overall number of counters 
used to track the history of branches. 







path(M) path 0-1 path 1-0 path 1-1 
Fig. 6 FSMs using Four 2-bit Counters 

Fig. 6 suggests that in order to select the proper 2-bit counter as- 
signed to each subhistory for prediction, one needs to memorize the 
branch path leading to h*. This can be achieved by using a 2-bit shift 
register which records the outcomes of the two most recendy ex- 
ecuted branches. The shift register is then used to select the appropri- 
ate counter. The use of a shift register for tracking and selectively 
relating the correlated information to proper branch subhistory is the 
main idea of the proposed correlation-based branch prediction. Ba- 
sically, the proposed scheme uses the branch path information to split 
the history of a branch into several subhistories and selectively use 
the proper subhistory information for predicting the outcome of the 
branch. 

Generally, an M-step correlation-based branch prediction uses 
the outcomes of the last M branches (including unconditional 
branches) seen by the machine to split the history of a branch into 2 M 
subhistories. The prediction is then done independently within each 
subhistory using any (or the best) history-based branch prediction al- 
gorithm. A good candidate for prediction within each subhistory is 
the N-bit counter-based branch prediction mentioned earlier. In this 
case, an M-bit shift register is required to store the outcomes of the 
last M branch executions (0 for not taken, 1 for taken). This shift reg- 
ister is able to identify a total of 2 M subhistories of a branch. Within 
each subhistory, the prediction is done using an N-bit counter asso- 
ciated with it. There are a total of 2 M FSM's associated with each 
branch. Everytime the outcome of a branch is to be predicted, theM- 
bit shift register is used to select the proper FSM, resulting in a set 
of N prediction bits. Once the FSM is selected, the prediction and the 
state update are done according to the N-bit counter-based predic- 
tion algorithm. 

In the following, we will refer to this scheme as the (M,N) corre- 
lation-based branch prediction scheme or simply the (M,N) corre- 
lation scheme, meaning that an M-bit shift register is used to select 



an N-bit counter for prediction. The number fc rrelatlon steps 
is defined as the number of bits in the shiftregister. When the predic- 
tion scheme used within each subhistory is understandable without 
any ambiguity, we will simply refer to it as an M-step correlation 
scheme. 

2.3 Implementation 

When the N-bit counter scheme or the (M,N) correlation scheme 
is implemented by itself, a table is required to store the prediction in- 
formation. We refer to this table as the "branch prediction table" or 
briefly, BPT. Fig. 7 (a) shows the logical organization of a lK-entry 
BPT for the 2-bit counter scheme, with each entry containing 2 pre- 
diction bits. Fig. 7 (b) shows the logical organization of a lK-entry 
BPT for the (2,2) correlation scheme, with each entry containing 
2X2 2 =8 prediction bits. 

Notice the difference in physical size of the two tables, even 
though the number of logical entries is identical. In general, if a 
2 l -entry table is used for (M,N) correlation scheme, a total of Nx2 I+M 
bits is required for the table, with each entry containing 2 M sets of N 
prediction bits. The table is generally accessed using the low-order 
I bits of the branch address. However, depending on the implementa- 
tion, the table may be accessed using the address of the instruction 
immediately prior to the branch under consideration [11]. Once the 
entry is determined, the M-bit shiftregister which stores the outcom- 
es of the last M branches is used to select the proper set of the N bits 
from the entry. These N bits are used for predicting the outcomes of 
all branches whose addresses are mapped into the same entry. 



branch addr. 



branch addr. 



10 bits 



2 pried, 
bits 



10 bits 



XX 



► XX 

2pred. 
bits 



select t j 2-bit shift register 

(a) 2-bit counter scheme (b) (2,2) correlation scheme 
Fig. 7 Logical Organization of a lK-Entry Table 

A design tradeoff in implementing the dynamic branch predic- 
tion usually involves in choosing the physical size of the BPT for a 
desired prediction accuracy. It is interesting to note from Fig. 7 that 
if the BPT size is to be changed, two "logical directions" can be con- 
sidered. The table size can be increased/decreased either along the 
vertical direction as shown in Fig. 8 (a) or along the horizontal direc- 
tion as shown in Fig. 8 (b). Fig. 8 (a) is typical for implementing the 
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counter-based scheme whereas Fig. 8 (b) is typical for correlation 
schemes. We will refer to the directions shown in Fig. 8 (a) and (b) 
as the entry-dimension and the correlation-dimension, respec- 
tively. Of course, any combination of the two is possible. 

□ 



entry- 
dimension 



□ 
□ 
□ 
□ 



□ *• DODO 



(a) increased along the 
entry-dimension 



correlation-dimension 



(b) increased along the 
correlation-dimension 



Fig. 8 Increasing the Size of a B FT 

While the logical organization and the behavior of the tables for 
the counter and the correlation schemes are different, the physical 
implementations are quite similar. Fig. 9 shows the implementation 
using a 1KB-BPT. When this table is used for the 2-bit counter 
scheme, 1 2 bits are required for a table lookup (Fig. 9 (a)). As men- 
tioned earlier, these 12 bits are usually obtained from the branch ad- 
dress. However, if the same table is used for correlation schemes, 
some of the bits for table lookup are obtained from the shift-register. 
For example, if the (8,2) correlation scheme is implemented as 
shown in Fig. 8 (b), the bits for table lookup consist of 8 bits from 
the shift register and 4 bits from the branch address. It is important 
to note that as a correlation scheme is implemented instead of the 
original 2-bit counter scheme using the same size of table, the only 
extra hardware cost incurred by the correlation scheme is the shift 
register (Fig. 9 (b)). 



branch 
addr. 



branch 
addr. 



8-bit 
shift rcg. 



12-bit 
shift reg. 



8^ 



} 


^ 12 
/ 




12 

7 


1 


^ 12 


1KB 




1KB 




1KB 








Dout 






Dout 





pred. 
bits 



f TT 



new 
state 



pred. 
bits 



new 
state 



pred. 
bits 



T 

new 
state 



(a) 2-bit counter 
scheme 



(b) (8,2) correlation 
scheme 



(c) (12.2) correlation 
scheme - degenerate 

Fig. 9 Physical Implementation Using a 1KB-BPT 



Fig. 9 (b) also shows an interesting case: as the table size is fixed, 
the larger the shift register used, the fewer branch address bits are re- 
quired. In other words, as the tabic size is fixed, increasing the size 
of the shift register is equivalent to "squashing" the BPT along the 



entry-dimension . An interesting extreme case occurs when the table 
degenerates to a single-entry table. In this case, the bits for table 
lookup are obtained entirely from the shift register. Fig. 9 (c) shows 
the degenerate case for a 1KB-BPT. This case is equivalent to imple- 
menting the (12,2) correlation scheme using a single-gentry table 
shown in Fig. 10. Similarly, Fig. 9 (a) can be thought as the other ex- 
treme case when the table in Fig. 9 (b) is "squashed" along the corre- 
lation-dimension. The advantage of considering the degenerate case 
is that its table lookup depends only on the shift register, completely 
independent of the branch address. Because of this unique character- 
istic, a resolved branch always predicts the outcome of the next 
branch. The degenerate case of the correlation scheme is interesting, 
not only because of its simple implementation, but also because the 
predicted outcome of a branch can be known way before the execu- 
tion of that branch. 

12-bit shift rcg. 
I 1 



12 



2 12 -1 



X'X 



X X 2 pred. bits 
Fig. 10 Degenerate Case for a 1KB-BPT 

3. Trace-Brivee Simulations & Results 

Trace-driven simulations are used to examine the (M,2) correla- 
tion schemes for BPTs with entries ranging from 1 to 32K. Due to 
the limitation of the program size and simulation time, only M^IO 
are evaluated for non-degenerate cases and M<15 for degenerate 
cases. Note that the scheme (0,2) corresponds to the original 2-bit 
counter scheme. The programs used for the experiment are from the 
SPEC benchmark suite. The traces are collected using a trace pro- 
gram and commercially available C and FORTRAN compilers for 
the IBM RISC System/6000 system. Table 2 summarizes the trace 
lengths and branch statistics for the benchmarks used in this study. 
The accuracy, defined as the percentage of correct predictions, will 
be used as the metric for measuring the efficiency of branch predic- 
tion. 

For SPEC floating-point benchmarks nasal, matrix300 t and 
tomcatv, no difference is found between correlation-based schemes 
and the 2-bit counter scheme. All predict with more than 99% accu- 
racy. These results are not surprising for loop-intensive scientific/ 
engineering applications where programming structures are domi- 
nated by simple loops. Because of this, only the results of the other 
7 SPEC benchmarks, namely, daduc, spice, fpppp, gcc f espresso, 
eqntott, and /*, are presented. For convenience, we will use the short- 
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hand "7 SPEC benchmarks" or "7 benchmarks" to mean these 7 
benchmarks, "floating-point benchmarks" to mean the benchmarks 
doduc, spice, and fpppp, and "integer benchmarks" to mean the 
benchmarks gcc, espresso, eqntott, and //. 

Table 2 Branch Statistics for SPEC Benchmarks 
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bu: frequency of unconditional branches 
b c : frequency of conditional branches 
p: probability that a branch is taken 
q: probability that a conditional branch is taken 
s: static conditional branches per 1 million executed 
conditional branches 

3.1 Accuracy for Fixed Table Size 

We first compare the correlation-based scheme with the 2-bit 
counter scheme using the same 1KB-BPT. Notice that the number 
of table entries for the two schemes are different (see Fig. 7). A 1 KB- 
table has 4K entries when the 2-bit counter scheme is implemented, 
whereas the same table has only 1 6 entries when the (8, 2) correlation 
schemes is implemented. 

Fig. 11 shows the results for a 1KB-BPT. The figure compares 
the accuracy obtained by implementing the 2-bit counter scheme and 
the additional accuracy gained by implementing the (8, 2) correlation 
scheme. Since the 2-bit counter scheme has already provided very 
high accuracies for doduc and espresso (about 95%), there is very 
little chance for correlation schemes to gain more accuracy. The 
benchmark gcc shows very little improvement in accuracy. This is 
because that a lKB-table is not large enough to contain most of the 
frequently executed branches in gcc. 

The remaining benchmarks show considerable improvements in 
accuracy. The two biggest gains in accuracy are obtained by eqntott 
and IL Since branches in eqntott are highly correlated, the 2-bit 
counter scheme cannot provide high accuracy (only about 83%). 
More than 1 1 % of additional accuracy can be attained by the correla- 
tion scheme. 

The second highest improvement in accuracy is achieved by /(' 
(more than 5%). li is known that li is a "pointer-chasing" oriented 
program where a compiler may generate load, compare, and branch 
instructions in sequence over and over again. The branch correlation 
exists wherever the data loaded for determining the branch direction 



is affected by the directions of prior branches. As reported in [2], a 
compare-branch pair of instructions in the IBM RISC System/6000 
machine causes a 3-cycle bubble in the pipeline. The correlation- 
based scheme proposed here is particularly useful to reduce such 
delay. 

Although we have only shown the results for the 8-step correla- 
tion scheme, it is observed from the simulation that, as the number 
of table entries is fixed, the accuracy increases as the number of cor- 
relation steps increases. This observation is true for all the 7 bench- 
marks. 
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li 



dod spi fpp gcc esp eqn 
H! accuracy for the 2-bit counter scheme 
j| additional accuracy gained by implementing 
the (8,2) correlation scheme with the same table 
Fig. 11 Accuracies for an 1KB-BPT: (0,2) v.s. (8,2) 

3.2 Accuracy at the Limiting Case 

It is observed that the accuracy provided by the 2-bit counter 
scheme asymptotically approaches certain limit as the B FT size in- 
creases. Fig. 12 shows the limit at which the 2-bit counter scheme 
saturates. When the table is large enough to contain most of the fre- 
quently executed branches, the prediction capability of the 2-bit 
counter scheme reaches its inherent limits. As we mentioned earlier, 
one of the limitations of the 2-bit counter scheme is that it is self-his- 
tory based. Since the correlation scheme provides better prediction 
by incorporating the information from other branches, it can surpass 
the limit at which the 2-bit counter scheme saturates. 

As an illustration, consider the accuracy curves for // shown in 
Fig. 13. It is clear that the accuracy provided by the 2-bit counter 
scheme saturates at a table of 2K entries. Increasing the table size 
along the entry-dimension as shown in the figure makes very little 
improvement in accuracy. However, if the BPT size is increased 
along the correlation dimension (see Fig. 8 (b)) , more accuracy can 
be gained. Fig. 12 shows the additional accuracy achievable by the 
correlation scheme for the 7 benchmarks. 
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I additional accuracy achievable by the correlation scheme 
Fig. 12 Limiting Case Accuracy 
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1 15 15" 

log 2 (# of table entries) 

H 2-bit counter scheme H (10,2) correlation scheme 
H (5,2) correlation scheme 

Fig. 13 Prediction Accuracy for // 

3.3 Accuracy at the Degenerate Case 

The degenerate correlation scheme provides an interesting case 
for a practical implementation, since its table lookup doesn't depend 
on the branch address. Because of this unique characteristic, the 
table lookup for the next branch can be done as soon as the current 
branch is resolved. This is attractive to timing-critical implementa- 
tions of the branch prediction. 

The only disadvantage with the degenerate case is that the table 
must be very large in order to outperform the 2-bit counter scheme. 
This is due to the fact that enormous amount of address conflicts are 
introduced with an one-entry table (Fig. 10). However, the effect of 



address conflict is attenuated when the table size is large. It is ob- 
served from the simulation that a larger correlation step is required 
before the degenerate case has a noticeable improvement over the 
2-bit counter scheme. Table 3 summarizes the observation. 

It is also observed that when the table size is large, the degenerate 
case sometimes performs better than the non-degenerate case. Fig. 
14 shows the results of implementing the degenerate (15,2) scheme 
using an 8KB-table. 



Table 3 # of Correlation Steps Required Before Degenerate Case 
has Noticeable Improvement Over the 2-Bit Counter Scheme 
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Fig. 14 Accuracy for an 8KB-BPT: (0,2) v.s. Degenerate (15,2) 



4. Conclusions 

In this paper, we have proposed a novel dynamic branch predic- 
tion scheme which uses the proper subhistory information of a 
branch to predict the outcome of that branch. The key idea is to relate 
the subhistory which is being selected to the most recently executed 
branches via a shift register. The new scheme is evaluated using 
traces collected from running the SPEC benchmark suite on an IBM 
RISC System/6000 machine. It is shown that the proposed new 
scheme gives considerably higher accuracy than that of the 2-bit 
counter prediction scheme at the extra hardware cost of one shift reg- 
ister. We have observed from the simulation that for the same BPT 
of size 1KB or above, the (M,2) correlation scheme generally pro- 
vides the best improvement in accuracy over the 2-bit counter 
scheme for 5<M<8. We want to emphasize that as more instruction- 
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level parallelism is exploited by today's superscalar and superpipe- 
lined processors, few percent increase in branch prediction accuracy 
is significant in improving the overall processor performance. 

We have demonstrated that the new scheme is simple and easy 
to implement. It provides a new dimension as a design alternative 
for increasing the BPT size, i.e., the correlation-dimension. We have 
also shown that the accuracy of the correlation scheme surpasses that 
of the 2-bit counter scheme at saturation. 
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Appendix 

Examples of Source Code-Level Branch Correlation from 
the SPEC Integer Benchmarks: 



benchmark: eqntott filename: pterm_ops.c 



if(aa = 2) 
aa = 0; 

if(bb = 2) 
bb = 0; 

if(aa != bb) { 



benchmark: eqntott filename: pterm_ops.c 



while (low <= high) { 
i = (high -flow)/ 2; 
if (H (i)<hsh) 

low = i + 1; 
else if (i > 0 && H (i-1) >= hsh) 

high = i-l; 
else if (H (i) = hsh) 

break; 

else return (NIL_PTERM); 



benchmark: eqntott file name: uebqsorte 



j = a=ij?i:ii); 

if ((*qcmp)(j. map) < 0) 
j = tmp; 



benchmark: li filename: xllistc 



while (*adstr && consp(list)) 
list = (*adstr+-f = V ? car(list) : cdr(list)); 



benchmark: li filename: xlread.c 



while ((ch = xlpeek(fptr)) != EOF) ( 
if (islower(ch)) ch = toupper(ch); 
if (!isdigit(ch) && !(ch >= 'A' && ch <= 'F» 
break; 
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benchmark: li file name: xlmath.c 



if (imode) 
switch (fen) { 

case '<*: icmp = (icmp < 0); break; 
case * L' : icmp = (icmp <= 0) ; break; 
case ' =* : icmp = (icmp = 0); break; 
case icmp = (icmp != 0); break; 

case 'G': icmp = (icmp >= 0); break; 
case V: icmp = (icmp > 0); break; 
) 

else 
switch (fen) { 

case *<': icmp = (femp < 0.0); break; 
case 'L': icmp = (femp <= 0.0); break; 
case '=': icmp = (femp — 0.0); break; 
case *#': icmp = (femp != 0.0); break; 
case *G': icmp = (femp >= 0.0); break; 
case '>': icmp = (femp > 0.0); break; 
} 

return (icmp ? true : NIL); 



benchmark: li filename: xlcont.c 



r break = FALSE; 
while (xleval(test) — NIL) { 
if (tagblock(arg.&rval)) { 

rbreak = TRUE; 

break; 

) 

) 

if(!rbreak) 



benchmark: espresso filename: compl.c 



for(pl = *L1, pr = *R1; (pi != NULL) && 

(pr!=NULL);) 
switch (dl_order(Ll.Rl)) { 
case 1: 

pr = *(++Rl); break; 
case -1 : 

pl = *(++Ll); break; 
caseO: 

RESET(pr, ACTIVE); 

WUNEset_pr(pl, pi, pr); 

pr = *(++Rl); 



benchmark: gec filename: reload.c 



if (in != 0) 

class = PREFERRED_RELOAD_CLASS (in, class); 
if (class == NO.REGS) 



benchmark: gec filename: cse.c 



if (elt != 0 && elt->related_value != 0) 
relt = elt; 

else if (elt == 0 && GET_CODE (x) == CONST) 



rtx subexp = get_related_value (x); 
if (subexp !=0) 
relt = lookup (subexp, 

safejiash (subexp, GET_MODE (subexp)) % 

NBUCKETS, 

GET MODE (subexp)); 

} 

if (relt =-0) 
return 0; 



benchmark: gec filename: flow.c 



for G = XVECLEN (x, i) - 1; j >= 0; j— ) 
{ 



if (value — 0) 
value = tern; 



] 



benchmark: gec file name: flow.c 



while (INSN_DELFTED_P (first)) 

first = NEXTJNSN (first); 
while (prev != first) 

{ 

prev = PREVJNSN (prev); 
PUT_CODE (prev, NOTE); 

NOTE_LINE_NUMBER (prev) = NOTE_INSN_DELETED; 
NOTE_SOURCE_FILE (prev) = 0; 

} 



benchmark: gec filename: cse.c 



if (tern != 0) 
y0 = tern; 

if(y0 = 0) 
return 0; 



benchmark: gec filename: cse.c 



switch (i) 
{ 

case 0: 

const_arg0 = const_arg; 

break; 
case 1: 

const_argl = const_arg; 

break; 
case 2: 

const__arg2 = const_arg; 

break; 

) 



switch (code) 
{ 



case EQ: 

if (const_arg0 && const_arg0 = XEXP (x, 0) 
&& (! (const.argl && const.argl = XEXP (x, 1)) 
II (GET_CODE (const_arg0) = CONST.INT 
&& GET_CODE (const.argl) != CONST_INT))) 
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Abstract 

As the issue rate and depth of pipelining of high perfor- 
mance Superscalar processors increase, the importance 
of an excellent branch predictor becomes more vital to 
delivering the potential performance of a wide-issue, 
deep pipelined microarchitecture. We propose a new 
dynamic branch predictor (Two-Level Adaptive Branch 
Prediction) that achieves substantially higher accuracy 
than any other scheme reported in the literature. The 
mechanism uses two levels of branch history information 
to make predictions, the history of the last Jb branches 
encountered, and the branch behavior for the last s oc- 
currences of the specific pattern of these k branches. We 
have identified three variations of the Two- Level Adap- 
tive Branch Prediction, depending on how finely we re- 
solve the history information gathered. We compute the 
hardware costs of implementing each of the three varia- 
tions, and use these costs in evaluating their relative ef- 
fectiveness. We measure the branch prediction accuracy 
of the three variations of Two- Level Adaptive Branch 
Prediction, along with several other popular proposed 
dynamic and static prediction schemes, on the SPEC 
benchmarks. We show that the average prediction ac- 
curacy for Two-Level Adaptive Branch Prediction is 97 
percent, while the other known schemes achieve at most 
94.4 percent average prediction accuracy. We measure 
the effectiveness of different prediction algorithms and 
different amounts of history and pattern information. 
We measure the costs of each variation to obtain the 
same prediction accuracy. 

1 Introduction 

As the issue rate and depth of pipelining of high per- 
formance Superscalar processors increase, the amount 
of speculative work due to branch prediction becomes 
much larger. Since all such work must be thrown away 
if the prediction is incorrect, an excellent branch pre- 
dictor is vital to delivering the potential performance of 
a wide-issue, deep pipelined microarchitecture. Even a 
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prediction miss rate of 5 percent results in a substantial 
loss in performance due to the number of instructions 
fetched each cycle and the number of cycles these in- 
structions are in the pipeline before an incorrect branch 
prediction becomes known. 

The literature is full of suggested branch prediction 
schemes [6, 13, 14, 17]. Some are static in that they use 
opcode information and profiling statistics to make pre- 
dictions. Others are dynamic in that they use run-time 
execution history to make predictions. Static schemes 
can be as simple as always predicting that the branch 
will be taken, or can be based on the opcode, or on the 
direction of the branch, as in "if the branch is backward, 
predict taken, if forward, predict not taken" [17]. This 
latter scheme is effective for loop intensive code, but 
does not work well for programs where the branch be- 
havior is irregular. Also, profiling [6, 13] can be used to 
predict branches by measuring the tendency of a branch 
on sample data sets and presetting a static prediction 
bit in the opcode according to that tendency. Unfor- 
tunately, branch behavior for the sample data may be 
very different from the data that appears at run- time. 

Dynamic branch prediction also can be as simple as in 
keeping track only of the last execution of that branch 
instruction and predicting the branch will behave the 
same way, or it can be elaborate as in maintaining 
very large amounts of history information. In all cases, 
the fact that the dynamic prediction is being made on 
the basis of run-time history information implies that 
substantial additional hardware is required. J. Smith 
[17] proposed utilizing a branch target buffer to store, 
for each branch, a two-bit saturating up-down counter 
which collects and subsequently bases its prediction on 
branch history information about that branch. Lee and 
A. Smith proposed [14] a Static Training method which 
uses statistics gathered prior to execution time coupled 
with the history pattern of the last k run-time execu- 
tions of the branch to make the next prediction as to 
which way that branch will go. The major disadvantage 
of Static Training methods has been mentioned above 
with respect to profiling; the pattern history statistics 
gathered for the sample data set may not be applicable 
to the data that appears at run-time. 

In this paper we propose a new dynamic branch pre- 
dictor that achieves substantially higher accuracy than 
any other scheme reported in the literature. The mech- 
anism uses two levels of branch history information to 
make predictions. The first level is the history of the 
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last k branches encountered. (Variations of our scheme 
reflect whether this means the actual last k branches en- 
countered, or the last k occurrences of the same branch 
instruction.) The second level is the branch behavior 
for the last s occurrences of the specific pattern of these 
k branches. Prediction is based on the branch behavior 
for the last s occurrences of the pattern in question. 

For example, suppose, for Jb = 8, the last ib branches 
had the behavior 11100101 (where 1 represents that the 
branch was taken, 0 that the branch was not taken). 
Suppose further that 5 = 6, and that in each of the last 
six times the previous eight branches had the pattern 
11100101, the branch alternated between taken and not 
taken. Then the second level would contain the history 
101010. Our branch predictor would predict "taken." 

The history information for level 1 and the pattern 
information for level 2 are collected at run time, elimi- 
nating the above mentioned disadvantages of the Static 
Training method. We call our method Two- Level Adap- 
tive Branch Prediction. We have identified three vari- 
ations of Two- Level Adaptive Branch Prediction, de- 
pending on how finely we resolve the history informa- 
tion gathered. We compute the hardware costs of im- 
plementing each of the three variations, and use these 
costs in evaluating their relative effectiveness. 

Using trace-driven simulation of nine of the ten SPEC 
benchmarks we measure the branch prediction ac- 
curacy of the three variations of Two-Level Adaptive 
Branch Prediction, along with several other popular 
proposed dynamic and static prediction schemes. We 
measure the effectiveness of different prediction algo- 
rithms and different amounts of history and pattern 
information. We measure the costs of each variation 
to obtain the same prediction accuracy. Finally we 
compare the Two-Level Adaptive branch predictors to 
the several popular schemes available in the literature. 
We show that the average prediction accuracy for Two- 
Level Adaptive Branch Prediction is about 97 percent, 
while the other schemes achieve at most 94.4 percent 
average prediction accuracy. 

This paper is organized in six sections. Section two 
introduces our Two-Level Adaptive Branch Prediction 
and its three variations. Section three describes the cor- 
responding implementations and computes the associ- 
ated hardware costs. Section four discusses the Simula- 
tion model and traces used in this study. Section five 
reports the simulation results and our analysis. Section 
six contains some concluding remarks. 

2 Definition of Two* Level Adaptive Branch 
Prediction 

2.1 Overview 

Two- Level Adaptive Branch Prediction uses two levels 
of branch history information to make predictions. The 
first level is the history of the last k branches encoun- 
tered. (Variations of our scheme reflect whether this 

1 The Nasa7 benchmark was not simulated because this bench- 
mark consists of seven independent loops. It takes too long to 
simulate the branch behavior of these seven kernels, so we omit- 
ted these loops. 



means the actual last Jb branches encountered, or the 
last k occurrences of the same branch instruction.) The 
second level is the branch behavior for the last s oc- 
currences of the specific pattern of these Jb branches. 
Prediction is based on the branch behavior for the last 
8 occurrences of the pattern in question. 

To maintain the two levels of information, Two-Level 
Adaptive Branch Prediction uses two major data struc- 
tures, the branch history register (HR) and the pattern 
history table (PHT), see Figure 1. Instead of accumu- 
lating statistics by profiling programs, the information 
on which branch predictions are based is collected at 
run-time by updating the contents of the history regis- 
ters and the pattern history bits in the entries of the 
pattern history table depending on the outcomes of the 
branches. The history register is a it-bit shift register 
which shifts in bits representing the branch results of 
the most recent k branches. 

Pattern Hktoy Table (PHI) 

Branch History 
Patten 



Branch Hiitor feflsbr (BHR) 00 00 




Figure 1: Structure of Two- Level Adaptive Branch Pre- 
diction. 

If the branch was taken, then a "1" is recorded; if 
not, a "0" is recorded. Since there are Jb bits in the 
history register, at most 2* different patterns appear in 
the history register. For each of these 2* patterns, there 
is a corresponding entry in the pattern history table 
which contains branch results for the last s times the 
preceding k branches were represented by that specific 
content of the history register. 

When a conditional branch B is being predicted, 
the content of its history register, HR t denoted as 

R c -kRc-k+i R c -i, is used to address the pattern 

history table. The pattern history bits S c in the ad- 
dressed entry PHT Ro _ kRc _ h+l Re _ t in the pattern his- 
tory table are then used for predicting the branch. The 
prediction of the branch is 

*c = A(S C ), (1) 

where A is the prediction decision function. 

After the conditional branch is resolved, the out- 
come R c is shifted left into the history register HR 
in the least significant bit position and is also used 
to update the pattern history bits in the pattern his- 
tory table entry PHT Re _ kRc ^ Re _ x . After being 
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updated, the content of the history register becomes 

iZ c _*+i/2 c -Jb+2 R c and the state represented by the 

pattern history bits becomes S c+ i. The transition of the 
pattern history bits in the pattern history table entry 
is done by the state transition function 6 which takes 
in the old pattern history bits and the outcome of the 
branch as inputs to generate the new pattern history 
bits. Therefore, the new pattern history bits S c +i be- 
come 

S c+ i =6(S C ,R C ). (2) 

A straightforward combinational logic circuit is used to 
implement the function 8 to update the pattern history 
bits in the entries of the pattern history table. The tran- 
sition function 6 } predicting function A, pattern history 
bits S and the outcome R of the branch comprise a 
finite-state Moore machine, characterized by equations 
1 and 2. 

State diagrams of the finite-state Moore machines 
used in this study for updating the pattern history in 
the pattern history table entry and for predicting which 
path the branch will take are shown in Figure 2. The 
automaton Last- Time stores in the pattern history only 
the outcome of the last execution of the branch when 
the history pattern appeared. The next time the same 
history pattern appears the prediction will be what hap- 
pened last time. Only one bit is needed to store that 
pattern history information. The automaton Al records 
the results of the last two times the same history pat- 
tern appeared. Only when there is no taken branch 
recorded, the next execution of the branch when the 
history register has the same history pattern will be 
predicted as not taken; otherwise, the branch will be 
predicted as taken. The automaton A2 is a saturating 
up-down counter, similar to the automaton used in J. 
Smith's branch target buffer design for keeping branch 
history [17]. 




Figure 2: State diagrams of the finite-state Moore ma- 
chines used for making prediction and updating the pat- 
tern history table entry. 

In J. Smith's design the 2-bit saturating up-down 
counter keeps track of the branch history of a certain 
branch. The counter is incremented when the branch 



is taken and is decremented when the branch is not 
taken. The branch path of the next execution of the 
branch will be predicted as taken when the counter value 
is greater than or equal to two; otherwise, the branch 
will be predicted as not taken. In Two-Level Adap- 
tive Branch Prediction, the 2-bit saturating up-down 
counter keeps track of the history of a certain history 
pattern. The counter is incremented when the result of 
a branch, whose history register content is the same as 
the pattern history table entry index, is taken; other- 
wise, the counter is decremented. The next time the 
branch has the same history register content which ac- 
cesses the same pattern history table entry, the branch is 
predicted taken if the counter value is greater or equal 
to two; otherwise, the branch is predicted not taken. 
Automata j43 and A4 are variations of A2. 

Both Static Training [14] and Two-Level Adaptive 
Branch Prediction are dynamic branch predictors, be- 
cause their predictions are based on run- time informa- 
tion, i.e. the dynamic branch history. The major dif- 
ference between these two schemes is that the pattern 
history information in the pattern history table changes 
dynamically in Two-Level Adaptive Branch Prediction 
but is preset in Static Training from profiling. In Static 
Training, the input to the prediction decision function, 
A, for a given branch history pattern is known before 
execution. Therefore, the output of A is determined be- 
fore execution for a given branch history pattern. That 
is, the same branch predictions are made if the same 
history pattern appears at different times during execu- 
tion. Two- Level Adaptive Branch Prediction, on the 
other hand, updates the pattern history information 
kept in the pattern history table with the actual results 
of branches. As a result, given the same branch his- 
tory pattern, different pattern history information can 
be found in the pattern history table; therefore, there 
can be different inputs to the prediction decision func- 
tion for Two-Level Adaptive Branch Prediction. Predic- 
tions of Two-Level Adaptive Branch Prediction change 
adaptively as the program executes. 

Since the pattern history bits change in Two-Level 
Adaptive Branch Prediction, the predictor can adjust to 
the current branch execution behavior of the program to 
make proper predictions. With these run- time updates, 
Two-Level Adaptive Branch Prediction can be highly 
accurate over many different programs and data sets. 
Static Training, on the contrary, may not predict well 
if changing data sets brings about different execution 
behavior. 

2.2 Alternative Implementations of Two- Level 
Adaptive Branch Prediction 

There are three alternative implementations of the Two- 
Level Adaptive Branch Prediction, as shown in Figure 
3. They are differentiated as follows: 

Two-Level Adaptive Branch Prediction Using a 
Global History Register and a Global Pattern 
History Table (GAg) 

In GAg, there is only a single global history regis- 
ter (GHR) and a single global pattern history table 
(GPHT) used by the Two-Level Adaptive Branch Pre- 
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GAff 




PAp 




Figure 3: Global view of three variations of Two- Level 
Adaptive Branch Prediction. 



diction. All branch predictions are based on the same 
global history register and global pattern history table 
which are updated after each branch is resolved. This 
variation therefore is called Global Two-Level Adaptive 
Branch Prediction using a global pattern history table 
(GAg). 

Since the outcomes of different branches update the 
same history register and the same pattern history table, 
the information of both branch history and pattern his- 
tory is influenced by results of different branches. The 
prediction for a conditional branch in this scheme is ac- 
tually dependent on the outcomes of other branches. 

Two-Level Adaptive Branch Prediction Using a 
Per-address Branch History Table and a Global 
Pattern History Table (PAg) 
In order the reduce the interference in the first level 
branch history information, one history register is as- 
sociated with each distinct static conditional branch to 
collect branch history information individually. The his- 
tory registers are contained in a per-address branch his- 
tory table (PBHT) in which each entry is accessible by 
one specific static branch instruction and is accessed by 
branch instruction addresses. Since the branch history 
is kept for each distinct static conditional branch indi- 
vidually and all history registers access the same global 
pattern history table, this variation is called Per-address 
Two-Level Adaptive Branch Prediction using a global 
pattern history table (PAg). 

The execution results of a static conditional branch 
update the branch's own history register and the global 
pattern history table. The prediction for a conditional 
branch is based on the branch's own history and the 
pattern history bits in the global pattern history table 
entry indexed by the content of the branch's history 
register. Since all branches update the same pattern 
history table, the pattern history interference still exists. 



Two- Level Adaptive Branch Prediction Using 
Per-address Branch History Table and Per- 
address Pattern History Tables (PAp) 



In order to completely remove the interference in both 
levels, each static branch has its own pattern history ta- 
ble a set of which is called a per-address pattern history 
table (PPHT). Therefore, a per-address history register 
and a per-address pattern history table are associated 
with each static conditional branch. All history regis- 
ters are grouped in a per-address branch history table. 
Since this variation of Two- Level Adaptive Branch Pre- 
diction keeps separate history and pattern information 
for each distinct static conditional branch, it is called 
Per-address Two- Level Adaptive Branch Prediction us- 
ing Per-address pattern history tables (PAp). 

3 Implementation Considerations 

3.1 Pipeline Timing of Branch Prediction and 
Information Update 

Two- Level Adaptive Branch Prediction requires two se- 
quential table accesses to make a prediction. It is dif- 
ficult to squeeze the two accesses into one cycle. High 
performance requires that prediction be made within 
one cycle from the time the branch address is known. 
To satisfy this requirement, the two sequential accesses 
are performed in two different cycles as follows: When a 
branch result becomes known, the branch's history reg- 
ister is updated. In the same cycle, the pattern history 
table can be accessed for the next prediction with the 
updated history register contents derived by appending 
the result to the old history. The prediction fetched 
from the pattern history table is then stored along with 
the branch's history in the branch history table. The 
pattern history can also be updated at that time. The 
next time that branch is encountered, the prediction is 
available as soon as the branch history table is accessed. 
Therefore, only one cycle latency is incurred from the 
time the branch address is known to the time the pre- 
diction is available. 

Sometimes the previous branch results may not be 
ready before the prediction of a subsequent branch takes 
place. If the obsolete branch history is used for making 
the prediction, the accuracy is degraded. In such a case, 
the predictions of the previous branches can be used to 
update the branch history. Since the prediction accu- 
racy of Two- Level Adaptive Branch Prediction is very 
high, prediction is enhanced by updating the branch his- 
tory speculatively. The update timing for the pattern 
history table, on the other hand, is not as critical as that 
of the branch history; therefore, its update can be de- 
layed until the branch result is known. With speculative 
updating, when a misprediction occurs, the branch his- 
tory can either be reinitialized or repaired depending on 
the hardware budget available to the branch predictor. 
Also, if two instances of the same static branch occur 
in consecutive cycles, the latency of prediction can be 
reduced for the second branch by using the prediction 
fetched from the pattern history table directly. 

3.2 Target Address Caching 

After the direction of a branch is predicted, there is 
still the possibility of a pipeline bubble due to the time 
it takes to generate the target address. To eliminate 
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this bubble, we cache the target addresses of branches. 
One extra field is required in each entry of the branch 
history table for doing this. When a branch is predicted 
taken, the target address is used to fetch the following 
instructions; otherwise, the fall-through address is used. 

Caching the target addresses makes prediction in con- 
secutive cycles possible without any delay. This also 
requires the branch history table to be accessed by the 
fetching address of the instruction block rather than by 
the address of the branch in the instruction block being 
fetched because the branch address is not known until 
the instruction block is decoded. If the address hits in 
the branch history table, the prediction of the branch 
in the instruction block can be made before the instruc- 
tions are decoded. If the address misses in the branch 
history table,. either there is no branch in the instruction 
block fetched in that cycle or the branch history infor- 
mation is not present in the branch history table. In this 
case, the next sequential address is used to fetch new in- 
structions. After the instructions are decoded, if there is 
a branch in the instruction block and if the instruction 
block address missed in the branch history table, static 
branch prediction is used to determine whether or not 
the new instructions fetched from the next sequential 
address should be squashed. 



3.3 Per-address Branch History Table Imple- 
mentation 

PAg and PAp branch predictors all use per-address 
branch history tables in their structure. It is not fea- 
sible to have a branch history table large enough to 
hold all branches' execution history in real implemen- 
tations. Therefore, a practical approach for the per- 
address branch history table is proposed here. 

The per-address branch history table can be imple- 
mented as a set-associative or direct-mapped cache. A 
fixed number of entries in the table are grouped together 
as a set. Within a set, a Least- Recently-Used (LRU) al- 
gorithm is used for replacement. The lower part of a 
branch address is used to index into the table and the 
higher part is stored as a tag in the entry associated 
with that branch. When a conditional branch is to be 
predicted, the branch's entry in the branch history ta- 
ble is located first. If the tag in the entry matches the 
accessing address, the branch information in the entry 
is used to predict the branch. If the tag does not match 
the address, a new entry is allocated for the branch. 

In this study, both the above practical approach and 
an Ideal Branch History Table (IBHT), in which there 
is a history register for each static conditional branch, 
were simulated for Two-Level Adaptive Branch Predic- 
tion. The branch history table was simulated with four 
configurations: 4-way set-associative 512-entry, 4-way 
set-associative 256-entry, direct-mapped 512-entry and 
direct-mapped 256-entry caches. The IBHT simulation 
data is provided to show the accuracy loss due to the 
history interference in a practical branch history table 
implementations. 



3.4 Hardware Cost Estimates 

The chip area required for a run-time branch predic- 
tion mechanism is not inconsequential. The following 
hardware cost estimates are proposed to characterize 
the relative costs of the three variations. The branch 
history table and the pattern history table are the two 
major parts. Detailed items include storage space for 
keeping history information, prediction bits, tags, and 
LRU bits and the accessing and updating logic of the 
tables. The accessing and updating logic consists of 
comparators, MUXes, LRU bits incrementors, and ad- 
dress decoders for the branch history table, and address 
decoders and pattern history bit update circuits for the 
pattern history table. The storage space for caching tar- 
get addresses is not included in the following equations 
because it is not required for the branch predictor. 
Assumptions of these estimates are: 

• There are a address bits, a subset of which is used 
to index the branch history table and the rest are 
stored as a tag in the indexed branch history table 
entry. 

• In an entry of the branch history table, there are 
fields for branch history, an address tag, a predic- 
tion bit, and LRU bits. 

• The branch history table size is h. 

• The branch history table is 2 ; -way set-associative. 

• Each history register contains k bits. 

• Each pattern history table entry contains 8 bits. 

• Pattern history table set size is p. (In PAp, p is 
equal to the size of the branch history table, h, while 
in GAg and PAg, p is always equal to one.) 

• C Si Cdt C Cj Cmt Cshi Ci, and C a are the constant 
base costs for the storage, the decoder, the com- 
parator, the multiplexer, the shifter, the incremen- 
tor, and the finite-state machine. 

Furthermore, i is equal to logih and is anon-negative 
integer. When there are k bits in a history register, a 
pattern history table always has 2 k entries. 

The hardware cost of Two- Level Adaptive Branch 
Prediction is as follows: 

Costsch^BHTihJ.klp x PHT(2\s)) 
= Cost B HT(K J,fc)+px CostpHr(2 k , a) 

~ {B BTstoragc-Space + B HTAcccMging^Logic 4* 

B HTupdating-Logic] +pX {P HTstorag^Space. + 
PHTa.ccc» sing-Logic "4~ P HTxjpdating-Logic } 

= {\h x (Taff (o _ l+>) _ blt + HRkjbit + Prediction^ it lM 
+LRU-BitsjMt)] + 
[1 x Address.DecodeTijbit -f 2 3 x 
Comporaior5( a _ l+J )_ fc(t -fix 2 3 X\.MV Xk-bit] + 
[h x SMfterkMt + 2' x LRUJncrementor8jj,i t ]} + 
p x {[2 fc X History-BiUs-bit] + 
[1 x Address-Decoderkjbit] + [StateMpdater Mm bit]) 
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= {h x[(a-s + j) + fc+l+j]x<7,+ 

[h x Cd + 2* x (a - i + » x C e + 2' x * x C m ] + 

(A x fc x Csh + 2 > x j x d]} +px {[2* x s x C # ] + 

[2* x C d ] + [5 x 2* +1 x C 0 ]}, a + j > t. (3) 

In GAg, only one history register and one global pat- 
tern history table are used, so h and p are both equal to 
one. No tag and no branch history table accessing logic 
are necessary for the single history register. Besides, 
pattern history state updating logic is small compared 
to the other two terms in the pattern history table cost. 
Therefore, cost estimation function for GAg can be sim- 
plified from Function 3 to the following Function: 

Cost GAg (BHT(l, x PHT{2 k y s)) 

= Coat B at{l, , k) + 1 x Co3t PHT (2 k , s) 
~ {[* + l]xC, + *xC, h } + 

{2 k x(jxC,+C d )} (4) 

It is clear to see that the cost of GAg grows exponen- 
tially with respect to the history register length. 

In PAg, only one pattern history table is used, so p 
is equal to one. Since j and s are usually small com- 
pared to the other variables, by using Function 3, the 
estimated cost for PAg using a branch history table is 
as follows: 

Co9t PAg (BHT(h,j,k),l x PHT{2 k ,s)) 
= Cost B HT(h t j, k) + 1 x Costpn T (2 k , s) 
~ {h x [(a + 2 x j + k + 1 - i) x C, + C d + 
k x C ah ]} + 

{2 k x (axC + Cc)}, a + j>i. (5) 

The cost of a PAg scheme grows exponentially with 
respect to the history register length and linearly with 
respect to the branch history table size. 

In a PAp scheme using a branch history table as de- 
fined above, h pattern history tables are used, so p is 
equal to h. By using Function 3, the estimated cost for 
PAp is as follows: 

Cost PAp (BHT(h,j t k),hx PHT(2 k t 3)) 
= Cost B HT(h t j, k) + hx Co3tpHT(2 k , s) 
^ {h x [(o + 2 x j + Jb + 1 - i) x C, + C d + 

* X Csh]} + 

A x {2* x(axC, + C d )}, a -f j > 1. (6) 

When the history register is sufficiently large, the cost 
of a PAp scheme grows exponentially with respect to the 
history register length and linearly with respect to the 
branch history table size. However, the branch history 
table size becomes a more dominant factor than it is in 
a PAg scheme. 

4 Simulation Model 

TVace-driven simulations were used in this study. A Mo- 
torola 88100 instruction level simulator is used for gen- 
erating instruction traces. The instruction and address 
traces are fed into the branch prediction simulator which 
decodes instructions, predicts branches, and verifies the 
predictions with the branch results to collect statistics 
for branch prediction accuracy. 



4.1 Description of Traces 

Nine benchmarks from the SPEC benchmark suite are 
used in this branch prediction study. Five are float- 
ing point benchmarks and four are integer benchmarks. 
The floating point benchmarks include doduc, fpppp, 
mat ruc300, spice2g6 and tomcatv and the integer ones 
include eqntott } espresso, gcc, and li. Nasal is not in- 
cluded because it takes too long to capture the branch 
behavior of all seven kernels. 

Among the five floating point benchmarks, fpppp, 
matrixWO and tomcatv have repetitive loop execution; 
thus, a very high prediction accuracy is attainable, in- 
dependent of the predictors used. Doduc, spice2g6 and 
the integer benchmarks are more interesting. They have 
many conditional branches and irregular branch behav- 
ior. Therefore, it is on the integer benchmarks where a 
branch predictor's mettle is tested. 

Since this study of branch prediction focuses on the 
prediction for conditional branches, all benchmarks 
were simulated for twenty million conditional branch 
instructions except gcc which finished before twenty 
million conditional branch instructions are executed. 
Fpppp, mat Ha:300, and tomcatv were simulated for 100 
million instruction because of their regular branch be- 
havior through out the programs. The number of static 
conditional branches in the instruction traces of the 
benchmarks are listed in Table 1. History register hit 
rate usually depends on the number of static branches 
in the benchmarks. The testing and training data sets 
for each benchmark used in this study are listed in Table 
2. 



Benchmark 


Number of 


Benchmark 


Number of 




Static 




Static 


Name 


Cnd. Br. 


Name 


Cnd. Br. 


eqntott 


277 


espresso 


556 


gcc 


6922 


li 


489 


doduc 


1149 


fppPP 


653 


matrix300 


213 


spice2g6 


606 


tomcatv 


370 







Table 1: Number of static conditional branches in each 
benchmark. 



Benchmark 


Training 


Testing 


Name 


Data Set 


Data Set 


eqntott 


NA 


int.pru3.eqn 


espresso 


cps 


bca 


gcc 


cexp.i 


dbxout.i 


xlisp 


tower of hartoi 


eight queens 


doduc 


tiny doducin 


doducin 


fpppp 


NA 


natoms 


matrix300 


NA 


Built-in 


spice2g6 


short greycode.in 


greycode.in 


tomcatv 


NA 


Built-in 



Table 2: Training and testing data sets of benchmarks. 
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In the traces generated with the testing data sets, 
about 24 percent of the dynamic instructions for the 
integer benchmarks and about 5 percent of the dy- 
namic instructions for the floating point benchmarks 
are branch instructions. Figure 4 shows about 80 per* 
cent of the dynamic branch instructions are conditional 
branches; therefore, the prediction mechanism for con- 
ditional branches is the most important among the pre- 
diction mechanisms for different classes of branches. 

Dynamic Branch Instruction Distribution 
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• SO 



ill 




TttA HA C*nL apt. get It PA daduc top not. Ipu twra. 
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Figure 4: Distribution of dynamic branch instructions. 



4.2 Characterization of Branch Predictors 

The three variations of Two- Level Adaptive Branch 
Prediction were simulated with several configura- 
tions. Other known dynamic and static branch 
predictors were also simulated. The configura- 
tions of the dynamic branch predictors are shown 
in Table 3. In order to distinguish the different 
schemes we analyzed, the following naming conven- 
tion is used: Scheme{ History{ Size, Associativity, 
Entry-Content), PatternJTabk.Set.Size x Pattern( 
Size, Entry-Content), Context-Switch). If a predictor 
does not have a certain feature in the naming conven- 
tion, the corresponding field is left blank. 

Scheme specifies the scheme, for example, GAg, 
PAg, PAp or Branch Target Buffer design (BTB) 
[17]. In History( Size, Associativity, Entry -Content), 
History is the entity used to keep history information 
of branches, for example, HR (A single history register), 
IBHT, or BHT. Size specifies the number of entries in 
that entity, Associativity is the associativity of the ta- 
ble, and Entry-Content specifies the content in each 
branch history table entry. When Associativity is set 
to 1, the branch history table is direct-mapped. The 
content of an entry in the branch history table can be 
any automaton shown in Figure 2 or simply a history 
register. 

In PatternJT able-Set-Size x Pattern( 
Size, Entry-Content), Pattern-Table.Set-Size is the 
number of pattern history tables used in the scheme, 
Pattern is the implementation for keeping pattern his- 
tory information, Size specifies the number of entries in 
the implementation, and Entry-Content specifies the 



content in each entry. The content of an entry in the 
pattern history table can be any automaton shown in 
Figure 2. For Branch Target Buffer designs, the Pattern 
part is not included, because there is no pattern history 
information kept in their designs. Context Switch is 
a flag for context switches. When Context -Switch is 
specified as c, context switches are simulated. If it is 
not specified, no context switches are simulated. 

Since there are more taken branches than not taken 
branches according to our simulation results, a history 
register in the branch history table is initialized to all l's 
when a miss on the branch history table occurs. After 
the result of the branch which causes the branch history 
table miss is known, the result bit is extended through- 
out the history register. A context switch results in 
flushing and reinitialization of the branch history table. 



OA((HR(l, ,r-ar), 
XXPHT(a r ,A3),Ic]) 



PAg(BHT(3S6,l,r-Bi}, 

lXPHT(3 r .A3).|€]) 
PA S (BHT(3ae,4,r.»), 
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PA 8 (SHT(5ia,l,r-«t), 
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PA S (BHT(513,4,r-*i), 

lXFHT(3 r ,Al),lc]) 
PA S <BHT(«13,4,r.u), 

lXPHT(3 r ,A3),[cJ) 
PAg(BHT(513,4 t r-ai), 

lxPHT(3 r ,A»),|«|) 
PAg(BHT(fil3,4,r-u), 

tXPHT(3 r ,A4).lcl) 
PAf(BHT(S13,4,r.»r), 

lxPHT(3 r .LT).[c)) 
PAg(IBHT(iaf, ,r.st), 

lXPHT(3 f .A3) ( |c|) 



PAp(BHT(M3,4,r..i), 
B13XPHT(3 r ,A3),t«:]) 
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BTB(BHT(Ma,4,A3), 
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356 
356 
513 
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913 
S13 
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' Config, 
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PHT ConUg. 



# of 
Entr. 



Entry 
Coat. 



A3 
Aim 

A3 
Attn 

A3 
Aim 

Al 
Aim 

A3 



Atm 
UT 
Aim 



Asc - Table Set' Associativity, Atm - Automaton, BHT - Branch 
History Table, BTB - Branch Target Buffer Design, Config. - 
Configuration, Entr. - Entries, GAg - Global Two-Level Adaptive 
Branch Prediction Using a Global Pattern History Table, GSg - 
Global Static Training Using a Preset Global Pattern History Ta- 
ble, IBHT - Ideal Branch History Table, int - Infinite, LT - Last- 
Time, PAg - Per-address Two-Level Adaptive Branch Prediction 
Using a Global Pattern History Table, PAp - Per-address Two- 
Level Adaptive Branch Prediction Using Per-address Pattern His- 
tory Tables, PB - Preset Prediction Bit, PSg - Per-address Static 
Training Using a Preset Global Pattern History Table, PHT - Pat- 
tern History Table, sr - Shift Register. 

Table 3: Configurations of simulated branch predictors. 

The pattern history bits in the pattern history table 
entries are also initialized at the beginning of execution. 
Since taken branches are more likely for those pattern 
history tables using automata Al, Al, A3, and A4, all 
entries are initialized to state 3. For Last-Time, all en- 
tries are initialized to state 1 such that the branches at 
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the beginning of execution will be more likely to be pre- 
dicted taken. It is not necessary to reinitialize pattern 
history tables during execution. 

In addition to the Two- Level Adaptive schemes, Lee 
and A. Smith's Static Training schemes, Branch Tar- 
get Buffer designs, and some dynamic and static branch 
prediction schemes were simulated for comparison pur- 
poses. Lee and A. Smith's Static Training scheme is sim- 
ilar in structure to the Per- address Two- Level Adaptive 
scheme with an IBHT but with the important difference 
that the prediction for a given pattern is pre-determined 
by profiling. In this study, Lee and A. Smith's Static 
TYaining is identified as PSg, meaning per-address Static 
Training using a global preset pattern history table. 
Similarly, the scheme which has a similar structure to 
GAg but with the difference that the second-level pat- 
tern history information is collected from profiling is 
abbreviated PSg, meaning Global Static Training using 
a preset global pattern history table. Per-address Static 
Training using per-address pattern history tables (PSp) 
is another application of Static Training to a different 
structure; however, this scheme requires a lot of storage 
to keep track of pattern behavior of all branches stati- 
cally. Therefore, no PSp schemes were simulated in this 
study. Lee and A. Smith's Static Training schemes were 
simulated with the same branch history table configu- 
rations as used by the Two- Level Adaptive schemes for 
a fair comparison. The cost to implement Static Train- 
ing is not less expensive than the cost to implement the 
Two- Level Adaptive Scheme because the branch history 
table and the pattern history table required by both 
schemes are similar. In Static Training, before program 
execution starts, extra time is needed to load the preset 
pattern prediction bits into the pattern history table. 

Branch Target Buffer designs were simulated with 
automata A2 and Last-Time. The static branch pre- 
diction schemes simulated include the Always Taken, 
Backward Taken and Forward Not Taken, and a pro- 
filing scheme. Always Taken scheme predicts taken for 
all branches. Backward Taken and Forward Not Taken 
(BTFN) scheme predicts taken if a branch branches 
backward and not taken if the branch branches for- 
ward. The BTFN scheme is effective for loop-bound 
programs, because it mispredicts only once in the exe- 
cution of a loop. The profiling scheme counts the fre- 
quency of taken and not-taken for each static branch 
in the profiling execution. The predicted direction of 
a branch is the one the branch takes most frequently. 
The profiling information of a program executed with a 
training data set is used for branch predictions for the 
program executed with testing data sets, thus calculat- 
ing the prediction accuracy. 



prediction accuracy scaled from 76 percent to 100 per- 
cent. 

5.1 Evaluation of the Parameters of the Two- 
Level Adaptive Branch Prediction Branch 
Prediction 

The three variations of Two- Level Adaptive Branch 
Prediction were simulated with different history regis- 
ter lengths to assess the effectiveness of increasing the 
recorded history length. The PAg and PAp schemes 
were each simulated with an ideal branch history ta- 
ble (IBHT) and with practical branch history tables to 
show the effect of the branch history table hit ratio. 

5.1.1 Effect of Pattern History Table Automa- 
ton 

Figure 5 shows the efficiency of using different finite- 
state automata. Five automata .41, v42, A3, AA } and 
Last-Time were simulated with a PAg branch predic- 
tor, having 12-bit history registers in a four-way set- 
associative 512-entry BHT. Al t A2, AZ, and A4 all per- 
form better than LasUTime. The four-state automata 
Al, A2, AZ y and AA maintain more history information 
than LasUTime which only records what happened the 
last time; they are therefore more tolerant to the devi- 
ations in the execution history. Among the four-state 
automata, A\ performs worse than the others. The per- 
formance of A2, A3, and A4 are very close to each other; 
however, A2 usually performs best. In order to show 
the following figures clearly, each Two-Level Adaptive 
Scheme is shown with automaton A2. 
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Figure 5: Comparison of Two- Level Adaptive Branch 
Predictors using different finite-state automata. 



5 Branch Prediction Simulation Results 

Figures 5 through 11 show the prediction accuracy of 
the branch predictors described in the previous session 
on the nine SPEC benchmarks. "Tot GMean" is the ge- 
ometric mean across all the benchmarks, "Int GMean" 
is the geometric mean across all the integer benchmarks, 
and W FP GMean" is the geometric mean across all the 
floating point benchmarks. The vertical axis shows the 



5.1.2 Effect of History Register Length 

Three variations using history registers of the 
same length 

Figure 6 shows the effects of history register length on 
the prediction accuracy of Two- Level Adaptive schemes. 
Every scheme in the graph was simulated with the same 
history register length. Among the variations, PAp per- 
forms the best, PAg the second, and GAg the worst. 
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GAg is not effective with 6-bit history registers, because 
every branch updates the same history register, causing 
excessive interference. PAg performs better than GAg, 
because it has a branch history table which reduces the 
interference in branch history. PAp predicts the best, 
because the interference in the pattern history is re- 
moved. 




Bsnchmatt 



Figure 6: Comparison of the Two-Level Adaptive 
schemes using history registers of the same length. 

Effects of various history register lengths 
To further investigate the effect of history register 
length, Figure 7 shows the accuracy of GAg with var- 
ious history register lengths. There is an increase of 9 
percent in accuracy by lengthening the history register 
from 6 bits to 18 bits. The effect of history register 
length is obvious on GAg schemes. The history regis- 
ter length has smaller effect on PAg schemes and even 
smaller effect on PAp schemes because of the less inter- 
ference in the branch history and pattern history and 
their effectiveness with short history registers. 




Figure 7: Effect of various history register lengths on 
GAg schemes. 



5.1.3 Hardware Cost Efficiency of Three Vari- 
ations 

In Figure 6, prediction accuracy for the schemes with 
the same history register length were compared. How- 
ever, the various Two-Level Adaptive schemes have dif- 
ferent costs. PAp is the most expensive, PAg the second, 
and GAg the least, as you would expect. When evaluat- 
ing the three variations of Two- Level Adaptive Branch 
Prediction, it is useful to know which variation is the 
least expensive when they predict with approximately 
the same accuracy. 

Figure 8 illustrates three schemes which achieve about 
97 percent prediction accuracy. One scheme is chosen 
for each variation to show the variation's configuration 
requirements to obtain that prediction accuracy. To 
achieve 97 percent prediction accuracy, GAg requires an 
18-bit history register, PAg requires 12-bit history regis- 
ters, and PAp requires 6-bit history registers. According 
to our cost estimates, PAg is the cheapest among these 
three. GAg's pattern history table is expensive when a 
long history register is used. PAp is expensive due to 
the required multiple pattern history tables. 
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Figure 8: The Two- Level Adaptive schemes achieve 
about 97 percent prediction accuracy. 

5.1.4 Effect of Context Switch 

Since Two- Level Adaptive Branch Prediction uses the 
branch history table to keep track of branch history, the 
table needs to be flushed during a context switch. Fig- 
ure 9 shows the difference in the prediction accuracy 
for three schemes simulated with and without context 
switches. During the simulation, whenever a trap oc- 
curs in the instruction trace or every 500,000 instruc- 
tions if no trap occurs, a context switch is simulated. 
After a context switch, the pattern history table is not 
re- initialized, because the pattern history table of the 
saved process is more likely to be similar to the current 
process's pattern history table than to a re-initialized 
pattern history table. The value 500,000 is derived 
by assuming that a 50 MHz clock is used and context 
switches occur every 10 ms in a 1 IPC machine. The 
average accuracy degradations for the three schemes are 
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all less than 1 percent. The accuracy degradations for 
gcc when PAg and PAp are used are much greater than 
those of the other programs because of the large num- 
ber of traps in gcc. However, the excessive number of 
traps do not degrade the prediction accuracy of the GAg 
scheme, because an initialized global history register can 
be refilled quickly. The prediction accuracy of fpppp 
using GAg actually increases when context switches are 
simulated. There are very few conditional branches in 
fpPPP an d all the conditional branches have regular be- 
havior; therefore, initializing the global history register 
helps clear out the noise. 
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Figure 9: Effect of context switch on prediction accu- 
racy. 

5.1.5 Effect of Branch History Table Imple- 
mentation 

Figure 10 illustrates the effects of the size and associa- 
tivity of the branch history table in the presence of con- 
text switches. Four practical branch history table imple- 
mentations and an ideal branch history table were sim- 
ulated. The four-way set-associative 512-entry branch 
history table's performance is very close to that of the 
ideal branch history table, because most branches in the 
programs can fit in the table. Prediction accuracy de- 
creases as table miss rate increases, which is also seen 
in the PAp schemes. 

5.2 Comparison of Two- Level Adaptive Branch 
Prediction and Other Prediction schemes 

Figure 11 compares the branch prediction schemes. The 
PAg scheme which achieves 97 percent prediction ac- 
curacy is chosen for comparison with other well-known 
schemes, because it costs the least among the three vari- 
ations of Two-Level Adaptive Branch Prediction. 

The 4-way set-associative 512-entry BHT is selected 
to be used by all schemes which keep the first-level 
branch history information, because it is simple enough 
to be implemented. The Two-Level Adaptive scheme 
and the Static Training scheme were chosen on the ba- 
sis of similar costs. 

The top curve is achieved by the Two- Level Adaptive 
scheme whose prediction accuracy is about 97 percent. 
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Figure 10: Effect of branch history table implementa- 
tion on PAg schemes. 



Since the data for the Static Training schemes are not 
complete due to the unavailability of appropriate data 
sets, the data points for eqntott, fpppp, matrizSOQ , and 
tomcatv are not graphed. PSg is about 1 to 4 percent 
lower than the top curve for the benchmarks that are 
available and GSg is about 4 to 19 percent lower with av- 
erage prediction accuracy of 94.4 percent and 89 percent 
individually. Note that their accuracy depends greatly 
on the similarities between the data sets used for train- 
ing and testing. The prediction accuracy for the branch 
target buffer using 2-bit saturating up-down counters 
[17] is around 93 percent. The Profiling scheme achieves 
about 91 percent prediction accuracy. The branch tar- 
get buffer using Last-Time achieves about 89 percent 
prediction accuracy. Most of the prediction accuracy 
curves of BTFN and Always Taken are below the base 
line (76 percent). BTFN's average prediction accuracy 
is about 68.5 percent and Always Taken 's is about 62.5 
percent. In this figure, the Two-Level Adaptive scheme 
is superior to the other schemes by at least 2,6 percent. 
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Figure 11: Comparison of branch prediction schemes. 
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6 Concluding Remarks 

In this paper we have proposed a new dynamic branch 
predictor (Two- Level Adaptive Branch Prediction) that 
achieves substantially higher accuracy than any other 
scheme that we are aware of. We computed the hard- 
ware costs of implementing three variations of this 
scheme and determined that the most effective imple- 
mentation of Two-Level Adaptive Branch Prediction 
utilizes a per-address branch history table and a global 
pattern history table. 

We have measured the prediction accuracy of the 
three variations of Two- Level Adaptive Branch Pre- 
diction and several other popular proposed dynamic 
and static prediction schemes using trace-driven sim- 
ulation of nine of the ten SPEC benchmarks. We have 
shown that the average prediction accuracy for Two- 
Level Adaptive Branch Prediction is about 97 percent, 
while the other known schemes achieve at most 94.4 
percent average prediction accuracy. 

We have measured the effects of varying the param- 
eters of the Two- Level Adaptive predictors. We noted 
the sensitivity to k, the length of the history register, 
and s, the size of each entry in the pattern history ta- 
ble. We reported on the effectiveness of the various 
prediction algorithms that use the pattern history table 
information. We showed the effects of context switch- 
ing. 

Finally, we should point out that we feel our 97 per- 
cent prediction accuracy figures are not good enough 
and that future research in branch prediction is still 
needed. High performance computing engines in the 
future will increase the issue rate and the depth of 
the pipeline, which will combine to increase further the 
amount of speculative work that will have to be thrown 
out due to a branch prediction miss. Thus, the 3 per- 
cent prediction miss rate needs improvement. We are 
examining that 3 percent to try to characterize it and 
hopefully reduce it. 
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Abstract 

Modern high-performance architectures require extremely accurate 
branch prediction to overcome the performance limitations of con- 
ditional branches. We present a framework that categorizes branch 
prediction schemes by the way in which they partition dynamic 
branches and by the kind of predictor that they use. The framework 
allows us to compare and contrast branch prediction schemes, and 
to analyze why they work. We use the framework to show how a 
static correlated branch prediction scheme increases branch bias 
and thus improves overall branch prediction accuracy. We also use 
the framework to identify the fundamental differences between 
static and dynamic correlated branch prediction schemes. This 
study shows that there is room to improve the prediction accuracy 
of existing branch prediction schemes. 

Keywords: branch prediction, branch correlation, branch stream 
characteristics. 

1 Introduction 

Recent work in branch prediction has led to the development of 
both hardware and software schemes that achieve good prediction 
accuracy by exploiting branch correlation [4,9, 11, 14, 15, 16, 17]. 
However, little attention has been paid to why these schemes 
behave better than prior ones and to where further improvements 
can be made. In this paper, we describe an analytic framework that 
helps answer these questions based on the fundamental character- 
istics of the branch prediction problem. In addition, we use the 
observations based upon this framework to indicate potentially- 
fruitful research directions that will allow computer architects to 
improve branch prediction accuracy. Further improvements in 
branch prediction accuracy will enhance the effectiveness of global 
instruction schedulers and the performance of multiple-instruction- 
issue machines. 

Branch prediction addresses two basic problems: predicting the 
direction of conditional branches, and quickly fetching instructions 
from the predicted target. These problems can be addressed sepa- 
rately, and in this paper, we limit ourselves to the former. In other 
words, we consider a branch prediction scheme to be a technique 
for improving performance by anticipating the outcome of condi- 
tional branches. Other work has shown how to couple a branch 
prediction scheme with a branch target buffer to eliminate the per- 
formance penalties of branches [7]. 
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Why branch prediction schemes perform differently is just as 
important as how well they perform. Only after explaining why a 
scheme works can one understand appropriate ways to improve or 
alter it. Recent work by McFarling [9] and by Chang et al. [4] uses 
analysis, reasoning, and experimentation to devise better hardware 
schemes for correlated branch prediction. In particular, McFarling 
[9] noticed significant redundancy in the two-level index of the 
correlation-based branch prediction scheme proposed by Pan, So, 
and Rahmeh [11]. By hashing the branch history with the branch 
address, McFarling's gshare scheme often improves prediction 
accuracy under the constraint of a fixed-size table of predictors. 
Similarly, Chang et al. [4] noticed that, for a fixed-size table of pre- 
dictors, branches biased to one particular branch direction more 
than 95% of the time exhibited better prediction accuracies on a 
two-level adaptive scheme [14] when one decreased the branch 
history length, while the rest of the branches exhibited better pre- 
diction accuracies when one increased the branch history length. 
This observation led them to propose several new hybrid branch 
prediction schemes with better overall prediction accuracies. 

Still, it is more difficult to understand the actual workings of 
today's branch prediction schemes than it needs to be. To make it 
easier to develop optimizations such as those proposed by McFar- 
ling [9] and Chang et al. [4], we present a unifying framework that 
allows one to analyze and categorize branch prediction schemes. 
Because the framework is based on a theoretical model of the 
branch prediction problem, it is general enough to encompass all 
branch prediction schemes proposed to date. The framework 
focuses attention on how a prediction scheme assigns the dynamic 
branches of the program to individual predictors. This information 
then directs our analysis of and our search for weaknesses in a par- 
ticular scheme, and allows us to isolate and compare different fac- 
tors that affect prediction accuracy. In particular, we explore the 
fundamental differences between hardware- and software-based 
branch prediction schemes that exploit branch correlation. This 
analysis suggests several ways to improve the overall prediction 
accuracy of today's branch prediction schemes. 

Section 2 describes our framework for classifying and analyzing 
branch prediction schemes. To demonstrate the generality of our 
framework, Section 2 presents many of today's popular branch 
prediction schemes in framework terms. In Section 3, we use the 
framework to explore the issues in when (and thus why) static 
schemes for correlated branch prediction work. Section 4 goes on 
to compare the differences between static and dynamic schemes 
for correlated branch prediction. As an example of the power of 
our approach, we also describe changes to correlation-based static 
and dynamic prediction schemes that improve their overall predic- 
tion accuracy. Section 5 summarizes the findings of this work. 
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2 A Framework for Branch Prediction 

Given a conditional branch in a program, the goal of a branch pre- 
diction scheme is to predict accurately the outcome of that condi- 
tional branch (i.e. that the branch will take or that the branch will 
fall through). 1 The most accurate branch prediction schemes pre- 
dict the next action of a branch based on some function of the past 
actions of that branch and possibly other branches in the program. 
To understand the capabilities of these branch prediction schemes 
and to compare competing schemes in a meaningful manner, we 
must be able to identify and quantify the important properties of 
branch prediction schemes. To achieve this goal this section 
defines a set of mathematical tools that allow us to analyze pro- 
gram and branch behavior in an abstract manner. 

2.1 Basic Definitions and Goals 

Let a branch execution e = (b % d) , e e Zx {0, 1 } be a pair 
consisting of an identifier be Z and a direction variable 
de {0, 1} . Intuitively, the identifier uniquely specifies a static 
branch in a program, and the direction variable indicates the direc- 
tion that the branch went. We define an execution stream or just 
stream as a sequence of branch executions. Intuitively, this corre- 
sponds to a branch trace of one invocation of a program, identify- 
ing in trace order the conditional branches executed and the 
directions that they went. A stream can also be formed by concate- 
nating the streams of multiple invocations of a program (possibly 
with different inputs). We refer to the original stream of all execu- 
tions in a run of the program as the program execution stream. A 
substream of a stream s is a subsequence of s . 

A predictor is a simple mechanism that predicts the next direction 
of a stream. A predictor may consider program characteristics (e.g. 
the opcode of the next branch to predict) in addition to any part of 
the past program execution stream. 2 The accuracy of a predictor is 
the number of correct predictions divided by the total number of 
predictions; accuracy measures how closely the predicted stream 
matches the actual stream. 

A prediction scheme is a comprehensive mechanism that takes a 
program execution stream, divides it into substreams, and directs 
each substream to a unique predictor. Figure 1 illustrates this con- 
cept. The objective in dividing the execution stream into sub- 
streams is that each substream should be more accurately 
predictable by its predictor. The accuracy of the prediction scheme 
is the total number of correct predictions divided by the total num- 
ber of predictions. 



divider 
mechanism 



program execution 
stream 



b5 


b3 


b4 


b5 


1 


1 


0 


I 



substreams predictors 

TTT»-§g 



TL 



Figure 1 . Framework for describing any prediction scheme. The 
divider mechanism splits the program execution stream into sub- 
streams, each of which is predicted by a single predictor. 



1 As a point of interest, the goal of a branch prediction scheme is slightly different 
than the goal of the computer architect. A computer architect's goal is to find a 
branch prediction neheme thut provide;, the beat performance (at powibly the 
smallest cost), and this may not be the scheme with the best prediction accuracy 

2. Here, we mean past program execution stream in the moM general sense so that we 
can consider branch executions from previous runs of the program (as me required 
for a pronle-based predictor) 



2.2 Dividing Streams 

Based on our formal definition of a prediction scheme, the key to 
building a more accurate prediction scheme involves the selection 
of the "right" divider and "good" predictors. In this subsection, we 
review several current methods for dividing a stream, and we dis- 
cuss the intuition behind these approaches. Once we have 
described the important properties of streams that relate to the 
problem of branch prediction, we then discuss existing predictors 
and their important characteristics. 

Existing schemes divide the program execution stream in a variety 
of interesting ways. In the simplest case, the divider is the identity 
function; the program execution stream is fed to a single predictor. 
The prediction scheme that statically predicts all branches taken 
[12] and the prediction scheme that uses a single 2-bit saturating 
up/down counter for all branches [7] are both examples of the 
identity divider function. 

The most popular divider function in today's microprocessors par- 
titions the program execution stream based on the static branch 
identifier. This partitioning ideally forms one substream for each 
static branch in the program (a per-branch substream) as shown in 
Figure 2. Formally, if there are n static branches in the program, 
then the divider creates n substreams, one for each static branch 
identifier. The divider assigns the ith execution e, = (bp d.) to 
the substream that corresponds to b { . The intuition behind this 
divider is that each branch should have its own predictor because 
the characteristics and past history of this branch are a good pre- 
dictor of its future behavior. Both the per-branch 2-bit counter 
scheme 3 [7] and per-branch profile-based prediction scheme [10] 
partition the program execution stream in this manner. 
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Figure 2. Subdividing the program execution stream into 
per-branch substreams. 

More recent branch prediction schemes further subdivide the per- 
branch streams. The intuition behind these schemes is that finer 
decomposition of a per-branch stream can increase the predictabil- 
ity of the individual substreams. For instance, Pan, So, and Rah- 
meh [11] describe a scheme (which Yeh and Patt call GAs [14]) 
that partitions each per-branch stream based on the pattern of 
directions of the k preceding branch executions in the program 
execution stream, as illustrated in Figure 3. The intuition here is 
that sections of code deal with related information, so tests of one 
condition arc likely to be placed near tests of related conditions. 
Formally, consider the ith execution in the program execution 
stream, e ( = (b.> d { ) . The GAs scheme considers not just b., but 
also the directions of the k preceding executions 
d t _2*»;d;_ k . These k bits are called the pattern history of pre- 
ceding branch executions. The k pattern bits are used to further 



3 In this .subsection, we ignore implementation issues that keep us from obtaining a 
hardware predictor per stauc branch. These issues are addressed in Section 2.4 and 
Section 4. Here, we are concerned only with the ideal intent of the branch predic- 
tion scheme 
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divide a per-branch stream (based on into 2 substreams. We 
refer to these substreams as per-branch global-pattern streams. 
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Figure 3. Subdivision in the GAs scheme. In addition to branch 
identifier, the pattern of k preceding branches in the program 
execution stream is used to further divide the branch streams, so 
there is one stream per pattern per branch. 



As another way to subdivide per-branch streams, Yeh and Patt 
describe a scheme called PAs [15] that uses the last k branches in a 
per-branch stream to further partition that per-branch stream. This 
leads to a different set of substreams from the GAs scheme. For- 
mally, consider the / th branch execution in the program execution 
stream, e. = (b r d ( ) which is an execution of branch b.. Let / f , 
l 2 >..;l k be the indices of the k previous executions of branch 

b. . The PAs scheme uses the pattern d, , d, rather than the 

pattern d t _ { , d._ 2 t ...,d i _ k to subdivide the per-branch stream 

(based on 6 ( . ) into 2* substreams. Since the former pattern is deter- 
mined only by executions of one branch, b t , PAs does not exploit 
any inter-branch correlation; instead it is designed to exploit 
repeating patterns in the execution of a single branch. For example, 
on a loop branch that iterates a constant c<k times, PAs 
approaches 100% branch prediction accuracy, because it will gen- 
erate substreams consisting solely of a single branch direction (i.e. 
it can recognize the pattern of c taken branches that will be fol- 
lowed by a fall-through branch). We refer to these substreams as 
per-branch branch-pattern streams. 

As a last example of how to subdivide per-branch substreams, we 
consider our scheme for static correlated branch prediction (sebp) 
[17]. This scheme divides both by branch and by the path of 
branches that led to the executed branch. A path differs from a pat- 
tern because it includes both the branch identifiers and the exe- 
cuted directions, not just the concatenation of direction bits. So our 
static correlated scheme uses the vector 

to encode the path by which b i was reached, and it uses this vector 
to subdivide the per-branch stream (based on b t ) into 
(2 x number of static branches) substreams. We refer to these 
substreams as per-branch global-path streams. 

2.3 Predictors and Streams 

Under our framework, the divider presents each substream to a sin- 
gle predictor. Each predictor considers some combination of the 
program characteristics, the past branch execution stream, and its 
own internal state (if any) in making a branch prediction. In this 
subsection, we review the range of existing predictors, and we dis- 
cuss the characteristics of streams that make them predictable. 



Predictors can be classified into two major types: static predictors 
and dynamic predictors. A static predictor must fix its prediction 
before the program runs, while a dynamic predictor is allowed to 
change its prediction during program execution. Streams that are 
largely invariant in branch direction can be accurately predicted by 
a static predictor. We say that a stream is strongly biased if the fre- 
quency of one direction is much greater than the frequency of the 
other direction, and that it is weakly biased if the frequencies are 
close to equal. We refer to the more prevalent direction of the 
stream as the majority direction; the other direction is conversely 
the minority. 

Researchers have investigated a variety of static program and 
branch characteristics to help determine the appropriate static pre- 
diction for an execution stream. For example, the simple static 
branch prediction scheme that always predicts branches to take 
[12] uses the statistical fact that branches tend to take more often 
than they fall through. The "backwards taken forwards not taken" 
(BTFNT) scheme [ 1 2] bases the static prediction on the sign of a 
branch's target offset. Other schemes employ a predictor that com- 
putes predictions as a function of the opcode of the branch [7]. 
Finally, methods like those described by Ball and Lams [2] use 
sophisticated heuristics about the program structure to generate a 
static prediction for each branch. 

Other than the static characteristics of the program and the 
branches in the program, researchers use a profile of the dynamic 
behavior of the program branches, gathered during an earlier pro- 
gram run, to set the static prediction of each branch. If the majority 
direction remains the same from the profile (training) to the testing 
run, then a profiled static predictor will perform well. To date, 
researchers have used only the overall bias of the past branch exe- 
cution to set the static prediction. In our earlier paper [17], we used 
other characteristics of the past execution stream, but we used this 
information to reorganize the program so that its individual branch 
streams are more strongly biased. 

In contrast, dynamic predictors can adapt to track the bias of a 
stream during a single execution of a program. This has the added 
benefit of not requiring any training or profiling before the pro- 
gram run. Surprisingly, there are very few designs for dynamic 
predictors. By far, the most popular dynamic predictor is the 2-bit 
saturating, up/down counter [12]. This predictor forms the basis of 
all of the correlated branch predictors described by McFarling [9], 
Pan et al. [1 1], and Yeh and Patt [14, 15, 16]. 

Lee and Smith [7] observed that the execution streams of most 
program branches tend to occur in long runs 4 and that an n-bit 
counter predictor can exploit this regularity. Smith [12] further 
observed that a 2-bit counter empirically provides an appropriate 
amount of damping (or hysteresis) to changes in stream direction. 
A 1-bit counter has no damping (it simply records the direction of 
the last branch), and 3 -bit and higher counters do not appear to 
offer large cost/benefit advantages over 2-bit counters [12]. Damp- 
ing trades off adaptability for vulnerability to short minority runs. 
A 2-bit counter is excellent at predicting streams with long minor- 
ity runs, and it is damped enough to ignore minority runs of length 
1 . This allows loop branches, for instance, to incur just one mispre- 
dict per loop, instead of two mispredicts (one on loop exit and one 
on loop reentry). 

The distribution of minority run lengths in a stream strongly relates 
to the effectiveness of today's dynamic predictors. Streams with 
long runs of one direction followed by long runs of the other direc- 



4 A run is a substring of the stream that consists entirely of one direction, and is 
bounded on either side by executions that go in the opposite direction (or the 
beginning or end of the stream) Note that a proper substring of a run is not itself a 
run 
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tion can be accurately predicted by a dynamic predictor but not by 
a static predictor. However, a large distribution of short minority 
runs can cause a dynamic predictor to exhibit worse accuracy than 
a static predictor because the dynamic predictor adapts too slowly 
to the changes in the runs. 

One other interesting property is the frequency of recurrent pat- 
terns in a stream. A pattern is a non-empty string we { 0, 1 } * . A 
recurrent pattern is a substring that occurs multiple times in a 
stream. Unlike bias and distribution of runs, which are typically 
used to predict streams that have been divided, this property is 
exploited by some dividers (e.g. the PAs scheme [15]). 

2.4 Implementation Details 

To this point, our explanations of existing branch prediction 
schemes focused on the ideal implementation of a scheme. For 
example, the explanation above describes a per-branch dynamic 
prediction scheme based on 2-bit counters as able to assign each 
per-branch stream to a unique 2-bit counter. In actual implementa- 
tions of per-branch 2-bit counter schemes, this is believed to be 
impractical. Implementors usually solve this by using just the / 
least significant bits of the branch address as an index into a table 
of 2 / counters. This means that, if two conditional branches have 
the same j lowest bits, their branch streams will be intermingled 
and sent to a single 2-bit predictor. We call this effect aliasing, as 
the original intent of the 2-bit counter scheme was to provide a sin- 
gle predictor per static branch. 

Issues in aliasing have led researchers to develop different branch 
prediction schemes that we would classify as based on the same 
ideal branch prediction model. For instance, the CAs scheme [14] 
and McFarling's gshare scheme [9] both ideally divide the pro- 
gram execution stream into per-branch global -hi story substreams, 
and both use a 2-bit counter as the base predictor. The gshare 
scheme requires fewer 2-bit counters for fixed values of j and k 
because it exclusive-ors, rather than concatenates, the k bits of pat- 
tern history with the bits of branch address when indexing into 
the limited table of 2-bit counters. This gives a requirement of 
^max{Kj) counters ^ j nsteac j 0 f 2* +; counters. Section 4.2 shows 
that aliasing potentially limits the effectiveness of the ideal divider 
by intermingling streams that we would ideally like separated. 

Static branch prediction schemes that can fix a prediction to each 
static branch in the program obviously do not suffer from these 
effects of aliasing. However, static schemes have their own poten- 
tial limitations due to implementation details. For example, the 
implementation of our algorithm for static correlated branch pre- 
diction [17] does not distinguish between paths that cross a proce- 
dure call or return boundary. In other words, they effectively 
truncate the vector that is used to divide the stream in the cases 
where a path crosses a procedure boundary. This truncation merges 
streams that would be separated by a more sophisticated divider. 
We distinguish aliasing from merging: aliasing combines streams 
from different static branches, while merging combines streams 
from one static branch. 

2.5 Hybrid Approaches 

Recent work in branch prediction by McFarling [9] and Chang [4] 
has proposed hybrid branch prediction schemes which group 
together multiple basic prediction schemes. The hybrid schemes, 
either statically or dynamically, select the basic prediction scheme 
that performs best on a stream. The model in this section can easily 
be extended to cover hybrid schemes; however, this paper focuses 
on the power in our model to analyze and improve the individual 



prediction schemes. Benefits to basic schemes will of course 
improve the hybrid schemes that include them. 

The framework illustrates two distinct avenues of research for 
improving the accuracy of a branch prediction scheme: one could 
attempt to improve the sophistication of the ideal model; or one 
could attempt to remove limitations imposed by current implemen- 
tation details. The next two sections give examples of each of these 
approaches, for both static and dynamic branch prediction 
schemes. 

3 Why Static Correlated Prediction Works 

The framework described in the previous section gives us a set of 
terms that can be used to describe, compare, and contrast the 
behavior of branch prediction schemes. In this section, we exam- 
ine a simple application of this framework to a pair of similar pre- 
diction schemes: per-branch static profile prediction and our static 
correlated profile prediction [17]. Per-branch static profiling has 
been shown to work well in a number of studies [5, 10]. In this sec- 
tion, we show how our code transformation exploits branch corre- 
lation to increase branch bias. 

As noted in Section 2, bias is key to static branch prediction. Fig- 
ure 4 plots the distribution of taken branch frequency averaged 
over all benchmarks and data sets. Table 1 presents a summary of 
our benchmarks and experimental methodology. The "Self His- 
tory** bars in Figure 4 show that, even for executables produced by 
today's compilers, most of the dynamic branches are strongly 
biased. This U-shaped distribution is what makes per-branch static 
branch prediction effective. 
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Figure 4. Histogram of branch bias, weighted by execution frequency. 
This plot averages over all benchmarks, giving equal weight to each data 
set run. The "Self-History" bars indicate the branch bias in the original 
executables. The "Path-History" bars indicate the branch bias of executa- 
bles after transformation to exploit branch correlation with a history 
depth of 12. The bias values represent the midpoint of a range, e.g. the 
"10%" bars capture bias values between 5% and 15%. Although this 
graph averages over all benchmarks and data sets, the trend of increased 
bias occurred in each individual run. These results train and test on the 
same dataset. 



The effect of exploiting branch correlation is to divide each per- 
branch stream into several separate streams, discriminating by cor- 
relation paths in addition to the static branch identifier. The "Path 
History" bars in Figure 4 show the distribution of taken branch fre- 
quency after our transformation to exploit branch correlation [17]. 
Compared to the "Self History*' bars, the "Path History" bars 
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Benchmark and Data Set 
Descriptions 



Total 
Branches 
Executed 



Static 
Branches 
Touched 



awk [awk] 
a 
b 
c 


pattern-directed scanning/processing, 
extensive test of awk's capabilities 
simple scanning and printing 
generate max array of 3 arrays 


GNU ver,2.15. 
2.54M 
0.62M 
4 99M 


5 

1393 
835 
968 


compress 
in 

'm 

ps 


comp]. compression using adaptive Le 
SPECint92 reference input 
jargon dictionary (1MB of ASCII) 
1 5-page postscript paper 


mpel-Ziv, SPEC 
11 4M 
13 1M 
2.0M 


3nt92 

277 
280 
268 


diffrdiff]: 
a 
b 
xsim 


differential file comparator, GNU versi 
two C Hies with 3 diffs 
two latex files with many diffs 
xsim sources with many diffs 


on 2.6 

0.43M 
0.27M 
0.72M 


646 
704 
711 


eqntott |eq 
fx2fp 
tbra 


n] boolean equation to truth table com 
8-bit fix to floating point encoder 
MIPS R2000 taken branch decode 


erMon, SPECin 
29.4M 
19.3M 


t92 

533 
528 


espresso [e 
bca 
cps 
ti 


:spj: boolean minimization. SPECint92 
SPECint92 reference input 
SPECSnt92 reference input 
SPECint92 reference input 


73.9M 
83 1M 
87 4M 


1722 
1845 
1899 


grep [grep 
a 

khad 
re3 


: pattern searching program, GNU vers 
search for a constant string (2 hits) 
complex regular exp. (100% hits) 
search for a regular exp. (21 hits) 


ion 2.0 

0.07M 
OHM 
0 33M 


611 

966 
878 


sc |sc] spr 
11 
lbl 
lb3 


eadsheet program, SPECint92 
SPEQnt92 short input 
SPECint92 reference input 
SPECSnt92 reference input 


23 5M 
179 3M 
44.4M 


1614 
1642 
1538 


xlisp|li]:l 
newt 
q4 
q7 


sp interpreter, SPECint92 
square root via Newton's method 
4 queens problem 
7 queens problem 


OHM 
041M 
32.4M 


550 
605 
605 



Table 1: Benchmark and data set descriptions The results in this paper 
were derived from trace-driven simulations. We collected the traces 
using ATOM vl.I [13]. We compiled the SPECint92 benchmarks 
using cc version 2.0 0 and the optimization level specified in the SPEC 
makefiles. The additional benchmarks were compiled using gec v2.6 0 
(-03). All of the experiments were performed on a DEC 3000/400 
running OSF/1 version 2.0 



exhibit a larger percentage of strongly biased branches. Over 70% 
of dynamic branches now occur in streams that are highly predict- 
able. In other words, the more finely subdivided per-branch global- 
path substreams are more predictable than the coarsely divided 
per-branch substreams. As we show further in Section 4.3, correla- 
tion shifts the distribution of streams and their dynamic branches 
toward stronger bias. 

For static profile prediction to be practical, the static predictions 
chosen must be valid across invocations of the program. If the 
majority direction of a stream differs between the profiled (train- 
ing) data set and the running (testing) data set, then a static predic- 
tor will suffer. Fisher and Freudenberger [5] examined a number of 
different benchmarks and data sets under static profile prediction, 
and determined that good prediction could be achieved even while 
training and testing on different data sets. Our experiences so far 



with various static correlated branch prediction schemes show sim- 
ilar results, although we have not yet done a comprehensive study. 
An exhaustive treatment of data variance is outside the scope of 
this paper. 

4 Comparing C rrelated Schemes 

This section uses the framework to tackle the much harder problem 
of comparing static and dynamic correlated branch prediction 
schemes. Superficially, one can compare the prediction accuracy 
reported by the designers of static and dynamic correlated 
schemes, but this numerical comparison is unenlightening. For 
example, in an earlier paper [17], we found that our static corre- 
lated branch prediction scheme did not achieve as high a predic- 
tion accuracy as the published dynamic correlated schemes. We 
cannot conclude from these results, however, that the dynamic 
schemes are necessarily better than static schemes since these 
schemes differ in more than their base predictors. 

Aside from the fundamental differences between a static and a 
dynamic predictor, our framework suggests that there are three 
major implementation differences in the divider function: the use 
of path versus pattern history, the aliasing of multiple (possibly 
unrelated) branches to the same predictor, and the lack of correla- 
tion information across procedure call boundaries. Path history is 
used in our software prediction scheme, while all current corre- 
lated branch prediction schemes based on a hardware table of pre- 
dictors use pattern history. The aliasing of per-branch streams 
occurs in hardware-based branch prediction schemes but not in the 
profiled branch prediction schemes. Finally, our software scheme 
for correlated branch prediction, unlike the hardware-based 
schemes, does not exploit correlation across procedure call and 
return boundaries. Each of these implementation differences can 
be seen as a limitation that keeps the implemented divider from 
behaving as precisely as an ideal mathematical divider. Sections 
4.1 through 4,3 show, by focusing on gshare [9], GAs [14], and our 
static correlated branch prediction scheme (sebp) [17], that the 
removal of these implementation differences can improve the pre- 
diction accuracy of correlated branch prediction schemes. 

Once we have equalized the divider function, an interesting ques- 
tion to ask is how much benefit one gets from the use of a dynamic 
predictor in a correlated scheme. Section 4.4 presents one answer 
to this question by comparing the prediction accuracy of a corre- 
lated branch prediction scheme that uses either static predictors or 
2-bit dynamic predictors. This experiment uses a theoretical 
divider function that is uninhibited by the implementation effects 
of Sections 4.1 through 4.3. Through this experiment, we can 
begin to understand the true need for dynamic predictors. By 
understanding where a dynamic predictor is beneficial, we expect 
to understand how to develop new code transformations to 
improve the static prediction schemes. 

4,1 Paths versus Patterns 

As explained in Section 2.2, an implementation difference exists 
between the divider used in our sebp scheme and the one used in 
an ideal GAs scheme. Our sebp scheme is based on a per-branch 
global-path divider that uses a history consisting of a path vector 
(path history), while GAs is based on a per-branch global-pattern 
divider that uses a history consisting of the pattern of the directions 
of the most-recent branch executions (pattern history). Path history 
should provide better correlation information than pattern history, 
because path history is a superset of pattern history. Path informa- 
tion includes the branches by which the current branch was 
reached, not just the pattern of directions that they went to reach 
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the current branch. For example, path information can differentiate 
between two streams in which two branches take to reach a third 
block. Figure 5 shows a simple example where that path history 
achieves a better prediction accuracy than the pattern history. 



A 




B 


if aa==0 




if aa==2 



ptfhAMY; 

pattern history = "tt" 
path outcome for Y = 
path BMY: 

pattern history = "tt" 
path outcome for Y = 




0% take 



100% take 



Figure 5. Example illustrating the benefit of path history over 
pattern history. Path AMY and BMY are indistinguishable using 
pattern history, but distinguishable using path history. 

To quantify the benefits of path information, we simulated a scbp 
scheme that used only pattern information, and we compared the 
prediction accuracy of this scheme against the prediction accuracy 
of the per-branch profile scheme and a scbp scheme using path his- 
tory. These results are summarized in Figure 6. For all bench- 
marks, the pattern-based scbp scheme shows significant 
improvements over per-branch static profile prediction. There is a 
small improvement in mispredict rate when path history is used, 
ranging from negligible in diff.a to removing 14% of mispredicted 
branches in espresso.bca. There is a measurable benefit to exploit- 
ing path history instead of pattern history, but the majority of 
advantage is gained just from pattern information. 




Figure 6. Mispredict rates of per-branch static branch prediction, static 
correlated branch prediction using pattern history, and static correlated 
branch prediction using path history. Both correlated schemes use a his- 
tory depth of k - 1 2 . All schemes train and test on the same data set. 

From this result in static branch prediction, we would like to gen- 
eralize to say that all branch prediction schemes could be 
improved by incorporating path history. But to do this, we need to 
isolate out the other factors that affect prediction accuracy so that 
these other factors do not overwhelm the gains due to using path 
instead of pattern history (and thus cloud the results). We will 
return to this issue at the end of the next section. 



4.2 Aliasing 

One important factor that differentiates a static profiled branch 
prediction scheme from a dynamic branch prediction scheme 
stems from the fact that dynamic branch prediction schemes map 
unevenly distributed information like branch address and pattern 
history into indexed, regular hardware structures (i.e. a table of 
predictors). In Section 2.4, we defined the term aliasing to describe 
the situation when, due to implementation limitations, a divider 
forces streams from different branches to map to the same predic- 
tor. A static profiled branch prediction scheme does not suffer from 
aliasing effects since each branch encodes its predictor function. 

Aliasing does not directly imply penalties to prediction accuracy. If 
two branches with different majority direction alias to the same 
counter, but one executes 1,000 times followed by the other exe- 
cuting 1,000 times, the loss due to aliasing is negligible. However, 
if the two branches alternate in trace order, then aliasing may cause 
significant misprediction. To relate aliasing back to prediction 
accuracy, we define three kinds of aliasing: If an aliased counter 
predicts an execution correctly while the corresponding per-stream 
counter predicts it incorrectly, we call that execution an instance of 
constructive aliasing since the aliasing improves prediction accu- 
racy. Conversely, if the aliased counter mispredicts while a per- 
stream counter would have predicted correctly, we call that an 
instance of destructive aliasing since the aliasing reduces predic- 
tion accuracy. Finally, if the aliased counter predicts an execution 
correctly (incorrectly) and the corresponding per-stream counter 
predicts correctly (incorrectly) too, we call that execution an 
instance of harmless aliasing since the aliasing does not change 
prediction accuracy. 

Our intuition is that aliasing is generally bad for prediction accu- 
racy. Since a branch prediction table is a kind of cache, aliasing is 
analogous to conflict misses in a cache. Instead of suffering con- 
flict misses though, aliased predictors suffer from muddled predic- 
tions. As in a cache, increasing the size of the prediction table can 
help to reduce conflicts (and increase prediction accuracy), as is 
shown in most of the dynamic branch prediction literature [10, 11, 
14, 15, 16], Chang et al. [4] show benefits to separating out 
strongly biased branches from weakly biased branches, noting that 
using static prediction on the strongly biased branches reduces 
contention (aliasing) in the table of 2-bit counters. Unlike cache 
conflict misses though, aliasing can be constructive or harmless in 
addition to destructive. It is important then to understand how 
aliasing affects the design space of dynamic branch prediction 
schemes. In particular, we investigate the question of how often 
aliasing happens in dynamic correlated branch prediction schemes 
and how this aliasing affects the prediction accuracy. 

To see how common aliasing is, we instrumented our hardware 
simulations to count the number of static branches that map to 
each 2-bit counter. We examined the usage patterns of the per- 
branch 2-bit counter scheme, the GAs scheme, and the gshare 
scheme for a fixed size table of 4096 2-bit counters. As an example 
of the type of results we saw, Figure 7 plots the distribution of the 
number of branches that alias to each counter for each scheme 
under the awk.a benchmark. From Figure 7, one can see that alias- 
ing happens infrequently in the standard 2-bit counter scheme. 
This makes sense since all benchmarks touch noticeably fewer 
than 4096 static branches (see Table 1). Aliasing increases in GAs, 
and reaches very high levels under gshare since these schemes 
produce significantly more than 4096 substreams from their divid- 
ers. Under gshare. running the espresso and sc benchmarks, alias- 
ing happens so often that in some cases no counters have fewer 
than three different branches aliased to them. Figure 8 shows detail 
on constructive versus destructive aliasing in awka under gshare; 
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Per-branch 2bc GAs gshare 




Figure 7. Aliasing in a 4096 counter table for owka under the per-branch 2 -bit counter ( 1 2 bits of branch address), GAs (6 bits of branch address concat- 
enated with 6 bits of branch history), and gshare (12 bits of branch address exclusive-ored with 12 bits of branch history) schemes. White squares repre- 
sent unused counters; black squares represent counters with seven or more aliased streams. 



Constructive Aliasing Destructive Aliasing 




Figure 8. Constructive and destructive aliasing in awta under the gshare scheme. In the left graph, gray scale indicates constructive aliasing, with black 
representing the maximum attained value of 32. In the right graph, gray scale indicates destructive aliasing, with black representing a difference of 32 or 

more correct incorrect predictions due to destructive aliasing. 



constructive aliasing is both rare and much smaller in magnitude 
that destructive aliasing. 

The more aggressive correlated branch prediction schemes pro- 
duced more substreams under the assumption that this aggressive 
subdividing would produce more predictable streams. As shown 
by the prediction accuracies of the schemes in Figure 9, this deci- 
sion can lead (though it does not always) to a design with a worse 
prediction accuracy. 

If we remove aliasing from the experiment in Figure 9, an 

unaliased per-branch global-pattern branch prediction scheme 

should achieve a higher branch accuracy than either GAs or 
gshare. To verify this, we modified our hardware simulation so 
that each per-branch, global pattern stream was assigned its own 
counter, then recorded how many executions led to destructive and 
constructive aliasing. Figure 10 presents the results of this experi- 
ment for all benchmarks. Clearly, aliasing happens regularly, and it 
happens destructively. There is often a significant improvement in 
the prediction accuracy for removing aliasing effects. Better 
dynamic prediction schemes are theoretically possible if those 
schemes can exploit the same pattern and address information as 
gshare without suffering destructive aliasing effects. 
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Figure 9. Mispredict rates for the per-branch 2-bit counter (12 bits of 
branch address), GAs (6 bits of branch address concatenated with 6 bits 
of branch history), and gshare (12 bits of branch address exclusive-ored 
with 12 bits of branch history) schemes; each using a table of 4096 2-bit 
counters. 
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Figure 10. Comparing the mispredict rates of correlated branch predic- 
tion schemes that contain aliasing and a branch prediction scheme with a 
true per-branch, global-pattern divider. All of the schemes use 2-bit 
counters and a history depth of 12 branches The GAs and gskare 
schemes use 4096 counters; the pattern history scheme uses one counter 
per stream. 

The grep,a bar in Figure 10 is the exception to the trend: the GAs 
scheme shows higher prediction accuracy than the unaliased pat- 
tern divider. From Table 1, one can see that grep.a executes very 
few branches. The worse prediction accuracy seems to be a result 
of the start-up costs of training a 2-bit counter to match a stream's 
bias. Since the unaliased divider produces more streams than the 
GAs divider, the unaliased divider pays a larger training cost. This 
larger training cost is significant on short benchmark runs; it might 
be reduced if schemes that use dynamic predictors could merge 
streams with similar initial values. 

Once we have removed the effects due to aliasing, we are in posi- 
tion to evaluate the benefit of path history over pattern history in 
dynamic schemes. We extended our simulator to use an unaliased 
path-history divider with dynamic predictors. The mispredict rates 
for this path-based predictor are presented in Figure 11. Using 
paths improves the mispredict rate on the majority of our bench- 
marks. As in the grep.a case from Figure 10, a few of the short 
benchmarks exhibit worse prediction accuracy under path rather 
than pattern history due to start-up training costs. Since the magni- 
tude of benefits from a path-based divider are sometimes small, 
designers must take care that improvements in prediction accuracy 
due to path history are not swamped by aliasing penalties intro- 
duced as part of the modified scheme. 

4,3 Cross-Procedure Correlation 

So far, the differences we have explored between static and 
dynamic correlated branch prediction schemes only hurt the pre- 
diction accuracy of the dynamic schemes. Yet the overall predic- 
tion accuracy of the dynamic schemes is often better. To explain 
this disparity, we collected statistics of cases where the hardware 
prediction schemes achieved better per-branch accuracy and then 
examined the kinds of correlation that occurred. The vast majority 
of such cases turned out to be cross-procedure correlation: 
branches that occurred just after a procedure entry or just after a 
procedure return. 

Our sebp scheme [17] cannot preserve correlation information 
across procedure calls. The scheme encodes correlation history 
into the program counter by duplicating basic blocks. A particular 
copy of a basic block implies some set of previous execution paths. 
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Figure 1 1 . Mispredict rates of schemes using 2-bit counters, a history 
depth of 12 branches, and a divider without aliasing effects (i.e. one 2-bit 
counter per stream). "Unaliased pattern" and "unaliased path" depict pat- 
tern and path history dividers, respectively. 



The problem is that the value of the program counter is effectively 
reset on a procedure call or return, eliminating correlation informa- 
tion across procedure calls. In terms of the framework, this means 
that the static scheme's divider is not always capable of using all of 
the components of the path history vector; the portion of the paths 
in the vector before a call boundary are merged into a single path. 5 
In the extreme, a branch just after a call or return will have no his- 
tory information available, in contrast, hardware schemes ignore 
procedure call boundaries, since they record conditional branch 
directions in additional hardware state. 

Some examples of cross-procedure correlation are obvious once 
they are pointed out: 

• The eqntott benchmark in the SPECint92 suite uses a quick- 
sort routine to sort bit vectors. A variety of different generic 
bit-vector comparison functions are passed to qst(). Each of 
these compare routines branches to different return points 
corresponding to equal, less than, or greater than return val- 
ues; qst() then immediately branches based on the return val- 
ues. The branch that tests the return value is completely 
determined by that the branch that set the return value. 

♦ The garbage collector's mark() function in the xlisp bench- 
mark calls livecati) to determine when to follow a node's left 
sublist. The switch statement inside livecar() returns the con- 
stant FALSE in many cases; this FALSE return value is then 
immediately checked by mark(). 

These kinds of cross-procedure correlation led us to ask how accu- 
rately a static prediction scheme could predict if it were possible to 
preserve path information across procedure boundaries. We modi- 
fied our trace and simulation environment to record paths across 
procedure call boundaries, and to simulate the prediction accuracy 
that would be obtained if a code transformation could preserve all 
desired correlation information across calls. The prediction accu- 
racy results where we trained and tested on the same data set are 
summarized in Figure 12. In these results, compress shows very 
little benefit from cross-procedure correlation, but this makes 

sense because compress is implemented as one large loop in a sin- 



5 Merging in not always harmful. As part of our sebp algorithm, we perform per- 
branch analysis that intentionally merges path streams with the same majority 
direction There is no penalty for this kind of merging when using a static predic- 
tor, sebp exploits this harmless merging to reduce overall code expansion. 
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gle procedure. In some benchmarks, like eqntott and awk, more 
than half of the mispredictions were removed. Other benchmarks 
showed more modest improvements. 




Figure 12. Mispredict rates of sebp using path history (same as the third 
series in Figure 6) and simulated mispredict rates for cross-call pattern 
and cross-call path history dividers with static predictors. These results 
use a history depth of 12 and train and test on the same dataset. 



The results in Figure 12 are not necessarily what we would expect 
from actual implementations of cross-call correlated static 
schemes, because they train and test on the same data set. This 
gives best possible static prediction accuracy, rather than what 
would occur if different training and testing data sets were used. 
However, these results show that using cross-call correlation we 
can achieve better static prediction accuracy than was previously 
believed possible. 

Having discussed the implementation differences in dividers, we 
can now revisit the effect of correlation on bias that we began to 
explore in Section 3. Figure 13 extends the results shown in Figure 
4, adding a new series of columns that shows the bias of streams 
generated by an unaliased, cross-call, path divider. The improved 
divider further steepens the U-shaped distribution of bias. 
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Figure 13. Histogram of branch bias, extending Figure 4, The "Cross- 
Call Path History" bars show the bias of streams generated by an 
unaliased, cross-call, path divider. These results use a history depth of 12 
and train and test on the same dataset. 



Exploiting Cross-Procedure Correlation Statically 

We have not yet found a simple code transformation that can gen- 
erally preserve correlation across calls. However, a number of 
techniques may be useful: selective inlining [6], template forma- 
tion, and multiple entry points [1]. 

Fisher and Freudenberger point out that sophisticated ILP compil- 
ers already expect to perform aggressive inlining [5], Inlining all 
procedures is impractical, since it is exponential in the depth and 
degree of the program call graph. But since a small number of pro- 
cedures make up the majority of program execution cycles [3], it is 
also likely that a small number of procedures are the best candi- 
dates for inlining to extract correlation. The livecar() routine in 
xlisp is a great candidate for inlining: it is called in just one place, 
and it is defined to be local to \hzxldmem.c source file. After inlin- 
ing livecar(), an optimizing compiler could fold the logically cor- 
related branches into a single branch, decreasing the number of 
static and dynamic branches in the program, and reducing cycle 
count. 

The eqntott case, above, is more complicated. Branches in qst() 
correlate into the generic comparison routine that is passed as a 
function pointer. It is not possible for a compiler to simply inline 
the comparison routine. However, it would be possible for a com- 
piler or programmer to build different versions of qst(), as if qst() 
were a C++ style template function that was instantiated for each 
comparison function. Since C functions are not first-class types, 
we could perform function variable propagation analysis to deter- 
mine all of the possible comparison functions. In fact, in eqntott, 
the comparison functions are constants passed in each call of qst(), 
so we could curry (specialize) the qst() call at compile time into a 
call to the appropriate version of qst(). 

We can preserve some correlation state across procedure calls by 
making multiple copies of procedure entry points, one for each rel- 
evant past execution history. This allows us to better predict callee 
branches that correlate back to the caller, but does not help us with 
the more common case of caller branches that correlate into some 
utility function. 

4.4 Adaptability 

The fundamental difference between the static and dynamic corre- 
lated schemes is the predictors they use. Dynamic predictors can 
adapt to track streams during an invocation of the program, while 
static predictors cannot. This raises the question of whether some 
streams require the adaptivity of a dynamic predictor to achieve 
good prediction accuracy, To examine this question, we used the 
same approach of the previous subsections: subtract out the differ- 
ences, and see what results. Once again, we used a divider with a 
path history of length 1 2 and no aliasing effects. We also made the 
divider ignore procedure call boundaries like the divider in a hard- 
ware implementation. 

We classified streams from the divider as "Static Better", "Equal", 
or "Dynamic Better", depending on whether a static predictor, nei- 
ther predictor, or a 2-bit counter best predicted the stream. Figure 
14 shows the distribution of streams for each benchmark and data 
set. The "Static Better" bars shows the percentage of streams 
which were better predicted by a perfectly trained static predictor; 
the "Static +1" bars show the percentage of streams where the 
static predictor predicted correctly just one more time than the 2- 
bit counter. The large number of "Static +1" streams have a major- 
ity fall-through direction, and since our simulation initializes 2-bit 
counters to predict weakly taken, the 2-bit counter incur a mispre- 
dict on the first execution in those strongly-biased streams. The 
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"Dynamic +1" streams are very rare, and there is a small but visi- 
ble number of "Dynamic Better" streams. 

The absolute number of "Dynamic Better" streams is less than 
1 ,000 for all benchmarks except espresso. This suggests that there 
are ways to build better hybrid static/dynamic prediction schemes 
than that proposed by Chang et al [4]. Their scheme assigns all 
branches with low bias to dynamic predictors. If we can assign 
only the rare adaptive streams (which might be aliased together or 
aliased with statically predictable streams in Chang et al.'s 
scheme) to their own predictors, while using static predictors for 
the remaining branches, we should be able to achieve even better 
prediction accuracy with fewer counters than previous hybrid 
schemes. 

Despite the small percentage of "Dynamic Better" streams in Fig- 
ure 14, those streams are an important component of overall pre- 
diction accuracy. Figure 15 gives details about the comp.in bar 
from Figure 14, plotting the difference in correct predictions. Even 
though the number of "Dynamic Better" streams is small, the 
"Dynamic Better" tail is significantly larger than the "Static Bet- 
ter" tail. The integral over the tails gives the differences in correct 
predictions between schemes using only static predictors and 
schemes using only dynamic predictors. The "Static Predictor" and 
"2-bit Counter" bars of Figure 1 6 compare the mispredict rates of 
such schemes. Even though we exaggerated the benefit with a 
static predictor by assuming perfect training, Figure 16 shows that 
the number of dynamic branches that occur in streams with long 
runs of the minority branch direction is significant — ignoring them 
will affect prediction accuracy. However, since the number of 
static streams requiring an adaptive predictor is very small, the 
possibility exists for a compiler to selectively apply techniques 
like predication [8] to these few streams. The vast majority of 
streams can be handled using simple static branch prediction tech- 
niques. 
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Higurc 14. Distribution of streams under an unaliased, cross-call, path 
divider, depending on whether the streams were predicted better by a per- 
fectly trained static divider or by a 2-bit counter. The #, + r categories 
contain streams where one of the predictor types correctly predicted just 
one more execution than the other predictor type. The "Better" categories 
contains streams where one predictor correctly predicts greater than one 
more execution than the other predictor. 



Hybrid prediction schemes can mix static and dynamic predictors 
in one scheme. The "Best Predictor per Stream" bars show the 
mispredict rate as if the best predictor (2-bit counter or static) for a 
stream was assigned on a per-stream basis, instead of assigning all 
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Figure 15. Detailed information on the comp.in bar from Figure 14. The 
horizontal axis shows the difference in correct predictions by the static 
and 2-bit counter predictors, Positive values correspond to the 2-bit 
counter predicting more accurately; negative values correspond to the 
static predictor predicting more accurately. 
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figure 16. Mispredict rates under an unaliased, cross-call, path divider, 
comparing assigning all streams to perfecdy trained static predictors 
("Static Predictor"), all streams to 2-bit counters ("2-bit Counter"), and 
the best predictor for an entire stream to that stream ("Best Predictor per 
Stream 1 '). 



streams to a single kind of divider. These bars show that schemes 
using a mix of static and dynamic predictors can achieve very high 
prediction accuracies. 

In the long term, adaptability may be the only thing that separates 
dynamic and static schemes, since static schemes can take cross- 
call correlation into account, and dynamic schemes can exploit 
path history and may be able to reduce aliasing problems. Correla- 
tion provides a useful tool to reduce the amount of adaptivity (both 
in dynamic branches and stream distribution) in a program, but no 
current methods allow us to completely eliminate the need for 
adaptivity. Hybrid schemes that use the techniques explored in this 
paper may be able to find efficient ways to separate and handle 

adaptive streams. 
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5 Conclusions and Future Work 



7 References 



Before one can build better branch prediction schemes, one must 
understand how and why existing schemes work. We presented a 
framework for analyzing and categorizing branch prediction 
schemes. The framework partitions schemes into two major parts: 
a divider and predictors. Dividers attempt to partition the program 
execution stream into substreams that are individually more pre- 
dictable than the original stream. All known branch prediction 
schemes fit into this framework. The framework provided the 
motivation for all of the studies in this paper, allowing us to practi- 
cally and systematically analyze the differences between schemes. 

Profiled per-branch static branch prediction works because pro- 
grams have a large percentage of branches that are strongly biased. 
Correlation changes the distribution of streams to increase the per- 
centage of branches that are strongly biased. Correlation reduces 
the diversity of branch streams, making profiled static correlated 
branch prediction more accurate than profiled per-branch static 
branch prediction. 

Under our framework* state-of-the-art static and dynamic predic- 
tion schemes differ in four major qualities: use of pattern versus 
path history, aliasing effects, ability to exploit cross-procedure cor- 
relation, and adaptivity. 

• Path history is slightly better than pattern history in exploiting 
branch correlation. 

• Correlated dynamic branch prediction schemes utilize more 
2-bit counters in their tables, but simultaneously increase the 
amount of aliasing. Removing the effect of aliasing increases 
prediction accuracy, suggesting that work should be done to 
limit aliasing in dynamic branch prediction schemes. 

• Cross procedure correlation limits the accuracy of static 
branch prediction schemes. We showed some large potential 
benefits to cross-procedure correlation in static schemes. We 
are pursuing several practical techniques that allow static 
schemes to exploit cross procedure correlation. 

• The percentage of adaptive streams is small, but that the 
dynamic branches executed in adaptive streams are signifi- 
cant. 

We have not reached the limits of existing basic branch prediction 
schemes. We have demonstrated potential for increased prediction 
accuracy in each of the areas above. Dynamic branch prediction 
schemes will benefit from methods to control aliasing and to 
exploit path history. Static branch prediction schemes will benefit 
from techniques that exploit cross-procedure correlation and 
reduce the need for adaptive predictors. 
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